Upgrade 2021: phi LAB Speakers

September 21, 2021 // Upgrade 2021: PHI LAB Speakers

The Future of Deep Learning: Why Optics?

Ryan Hamerly, Senior Scientist, NTT Physics and Informatics Lab

Transcript of the presentation The Future of Deep Learning: Why Optics?, given at the NTT Upgrade 2021 Research Summit, September 21, 2021.

Ryan Hamerly: Today I’m going to be talking about the future of AI, and why I believe that optics plays a pivotal role in that future. An outline of the talk is as follows: first, I’m going to discuss the important interplay between developments in AI and developments in computer hardware. And second, I’ll go into how the developments of computer hardware necessitate the use of optics in future AI. And finally, I’ll talk about some of our current work that gives examples for this.

So AI and hardware, AI or deep learning, is a relatively recent discipline that has combined trends in three different fields. One is data, in many different disciplines, such as image and speech classification, or auto-generated data and things like reinforcement learning and game-playing automation applications.

The second is the application of algorithms called deep neural networks, which are just sequences of layers, each layer consisting of a matrix vector multiplication, a linear step, and a nonlinear activation function. And then the third important factor is hardware. So as deep neural networks get bigger, they get more powerful, but they also become more compute hungry. So the combination of these three factors, data, algorithms, and hardware is really what has driven deep learning forward in the last decade and a half, and is what has enabled all of these amazing breakthroughs that people have been seeing in recent years. And an important trend to recognize is that with neural networks, it’s almost without exception true that bigger is better. So for a great comparison for this, let’s look at one of the earliest deep neural networks or convolutional neural networks: this is LeNet.

And I would say that this is the first convolutional neural network that was applied to an interesting problem. So this was developed by Yann LeCun, back in 1998. And it was also one of the first neural networks to be trained by gradient descent via backpropagation, and was used for, was very good at digit and handwriting classification problems. This is very, what we would now consider a very simple problem of the “MNIST” digit classification. So we go from there to AlexNet in 2012, that was a much larger and deeper neural network that was applied to this other image classification problem. And superficially these two class problems look very similar, but in reality, this ImageNet classification problem is much harder. It has a thousand different classes of images. And even if you or I looked at many of these images, often it’s not easy to solve this problem.

So AlexNet was the first neural network to perform reasonably well at this ImageNet problem. And actually after that, all of the winners in the ImageNet competition were based on deep neural networks,  and then around 2014, 2015, the performance of the neural networks surpassed the performance of a trained human. So going from ‘98 to 2012, what’s changed? Well, there are some very important things on the algorithm side, such as a change in the loss function from mean squared error to something that’s much more, better statistically motivated for this problem, which is categorical cross-entropy. And then a change in the activation function, nonlinearity from tanh(x) functions to ReLU(x) functions that avoid vanishing gradient problems.

But the main difference between these two is that AlexNet is much bigger. Now how much bigger? To quantify the size of a neural network we need to look at two things. One is the number of, the amount of memory that the neural network takes up, and that’s almost exclusively the total number of weights. And then second is the amount of compute that’s required. And for deep neural networks, it’s the matrix vector multiplication that bottlenecks it. And so the important figure of merit here is the number of multiply-accumulates or MACs. So let’s just take a typical neural network that consists of convolutional layers and consists of fully connected layers, the number of weights and the number of MACs in each layer of the neural network consists of – are products with things like the size of your kernel, the number of channels or the number of neurons in the input and output.

I think comparing LeNet to AlexNet. Well, here are the number of MACs for LeNet. It’s about 400,000 and then around 60,000 weights. For AlexNet, you sum up the layers, and you find that it has about 600 million MACs, as well as around 60 million weights. So going from LeNet to AlexNet entailed a roughly 1,600x increase in the number of MACs and roughly 1,100x increase in the number of weights. And that’s only the start. So deep neural networks have been getting deeper and bigger ever since. Going from AlexNet to ResNet was an additional around 15x increase. And that has so far been supported by improvements in digital hardware, going from single core CPUs now to these many core GPUs and special purpose processors.

But of course the exponential trends have continued, both in terms of the amount of memory and also in terms of the amount of compute, both for training and for inference. And so the real question is: are developments in hardware going to be sufficient to keep up with these exponential trends we see here, in the demand for neural networks. And up until now, the demand has always been met with improved electronic performance driven primarily by Moore’s Law. And Moore’s Law says that in each “generation” of a microprocessor, which is roughly 18 to 24 months, the components on your microprocessor shrink – in particular, the transistor gate length by a factor of root two, and this doubles the number of transistors on your chip. And despite what we’ve been hearing in the last couple of decades about Moore’s Law being dead, it actually isn’t dead as far as this metric is concerned.

But at the same time gains in single threaded performance have stalled. And this is due to another scaling trend that used to hold that’s no longer true, which is called Dennard Scaling. So Dennard’s Scaling says that if you scale down a transistor and you apply a rule of a constant E-field, then you get this voltage scaling and that voltage scaling leads to a change in the scaling in the time delay and a scaling in the time delay leads to this rather remarkable power density that’s independent of the size of the transistor or the number of transistors. And this is great because it means that both your transistors go faster and you don’t run out of power on your chip. But what we’ve seen happen, has happened in practice is that Dennard Scaling has held until about 2006 and then eventually it’s flat-lined.

And the reason that flatlined is due to a combination of gate tunneling, which means that certain features can no longer be made smaller. And thermodynamics, which means that the turn on voltage of your transistor gates is no longer limited by the dimensional factors of your transistor, but now it’s limited by thermodynamic factors that you have no control over, unless you want to cool down your chip. So this has led to a lower limit, to CMOS voltages of around 0.5 volts. And that has led to the end of Dennard Scaling and correspondingly the end to many of the scaling factors that used to drive the performance of microprocessors. So since 2006, we’ve got no more megahertz. And since around 2010, we’ve got no more watts on a chip and single-threaded performance has largely stalled. All of the subsequent improvements in chip performance have been due to parallelization.

So how does this motivate optics? Again, let’s look back at our problem. Our problem is we want to, we want to run deep neural networks. And the majority of that is running matrix multiplication. And as we found, thanks to the breakdown of Dennard Scaling, we’re now limited by energy consumption on the chips. So the goal of any neural network hardware is going to be to minimize the energy consumption, normalized to your performance. And then thanks to the breakdown of these scaling laws, we’re more and more limited in the performance of chips, not due to the energy consumption in processing itself, but due to the energy consumption and data movement of interconnects. And this has been realized for a couple of years now, I’m just highlighting here two, I would say, seminal papers that have recognized this.

One is a conference paper by Mark Horowitz, that really introduced computing’s energy problem and talked about ways to solve it. And this is a more recent paper that focuses on designing neural networks, subject to these energy constraints. So how does optics help us? There are two possible ways. One is that we can keep doing all of our logic and digital processing, but just use optics to transfer the data. And so this would be optical interconnects, and this is also a field that has a pretty long history, and also a history of many promises that were not met or not met on time. But I think that now with the increased demand for deep neural networks, with massively parallel computing, it may be necessary to revisit these ideas of on-chip optical interconnects.

But that’s not what I want to talk about. I want to talk about whether optics is useful for deep learning, not only as a replacement for interconnects, but also for doing computing itself. And this is different from just building an optical computer. There’s also been a lot of work in the past on building general purpose digital optical computers by making optical transistors. And it turns out that this is very hard because many of the things that are easy in electronics, such as nonlinearity and memory, are hard to do in photonics, at least at the required energy scales. On the other hand, as we’ve seen with the progress in optical interconnects, things that are easy in electronics, but require an energy cost such as communication and fan-out, once you get the data in the optical domain are essentially free in optics. So interconnects are still believed to be a promising application of optics.

But what I want to talk about is not interconnects, but the fact that optics is also potentially very useful for linear algebra. And the reason is that to do linear algebra and digital processors, there’s a large overhead between the matrix products to the individual scalar products, to the bitwise operations that go to build up those multiplications and additions. On the other hand, if you’re willing to keep your data in the analog domain, then optics is almost an ideal platform for linear algebra simply because Maxwell’s equations are usually linear and therefore optical neural networks should be very promising. So most of deep learning computationally is just linear algebra. If you can map it to these passive linear optics, then potentially you can get neural networks that will run very fast in the analog domain and neural networks are also known to be robust against analog errors.

So in addition to that, you, you want to be able to show that an optical system has a potential to, a potential speed up over a comparable digital system. And so just to illustrate this and compare it to other forms of speed up, people are familiar with: One is in parallelism that people are familiar with. One is data parallelism, this is something that digital processors already take advantage of. You have a von Neumann bottleneck between memory and a processor. And the way you get around this von Neumann bottleneck usually is to reuse your data whenever you send it to the processor over many processing elements. But data parallelism is always limited by of course, the energy consumption in your processing elements, as well as the links on the chip. There’s quantum parallelism. And that’s very useful for specific problems where you’re able to decompose the problem that you want to satisfy, that you want to solve into some kind of, you want, you solve it through first encoding your data in a superposition, and then acting on every element in that superposition at once.

And for certain problems such as factoring and chemistry simulation, this allows you to run them much faster than on a conventional computer and even exponentially so. But for optical systems, I want to talk about what I would call optical parallelism. And this is the idea that usually an optical system will be a subroutine in a larger system that’s hybrid electro-optical. So you’ll have some data that comes in, it gets converted to the optical domain and then processed in the optical domain and then reconverted to the electronic domain. So if you’re able to do a large number of operations in the optics, either passively or with ultra low energy consumption relative to the number of convergence steps, then you get a large parallelism factor here. And this parallelism factor can take many forms.

One is fan-out where I first convert it to optical. And then I reuse it many times either in space or time. Another is integration where each of those data points they’re integrated either in time or in wavelength. And a third kind of optical parallelism is depth, which is if I have a large number of optical signals, they can propagate through some material and that the propagation through that structure will lead to performance of some kind of a function that requires more ops than the number of inputs and outputs. So a great example here is the optical Fourier transform. You get N log-in ops with only N inputs and outputs or propagation through a Beamsplitter Mesh where you get N squared operations with only N inputs and outputs.

So the takeaway here is that optical processors show an advantage when optical parallelism at least connected to the system size is maximized. And the amount of parallelism is very related to the system size. And therefore a scaling is paramount to making any kind of optical system perform well. So I’d like to now touch on some of our current work and how that we hope that many of these new approaches to machine learning might enable us to solve the problems that I mentioned before.

In this work I’m going to talk about three particular approaches we have to optically accelerated computing. One is based on Beamsplitter Meshes, one is based on coherent detection, and a third is an approach for edge computing based on optics.

I’ll start with the most conventional one, which is what I would call a weight stationary optical neural network. And, and many of our colleagues also refer to it as a programmable nanophotonic processor or a PNP. And this, the idea here is that, uh, you, you want to perform matrix vector multiplication on data that’s encoded in the optical domain. And the way that’s done is by first encoding that data in the coherent amplitudes of signals that enter into a Beamsplitter Mesh. The Beamsplitter Mesh consists of a sequence of Mach-Zehnder interferometers. And each of these Mach-Zehnder interferometers perform a particular two by two unitary transformation on the data. And if you cascade enough of these two by two unitaries, you can realize a general end by end unitary transformation between the inputs and the outputs of this system.

Now, an N by N matrix multiplication, that’s N squared MACs, but here I only have N inputs and N outputs. So the potential speed up here through the optical parallelism is a factor of order N. Now I should mention that this is not the only way to do it. This is an approach that was pioneered by our group over at MIT, actually, before I joined, but there’s a competing approach that’s based on Microring resonator weight banks. But the idea is the same, is that you’re encoding weights in a neural network into phase shifts. In this case, their phase shift is in a Mach-Zehnder interferometer, but in this case, their phase shifts in the individual rings, which lead to detunings in the rings. And as a result, the number of phase shifters on a chip is closely tied to the number of weights.

So here we have a half of an MZI per weight, and here we have one ring per weight, and there are order N squared weights in a given layer of the neural network. So immediately this leads to a challenge to scaling the large neural networks with this weight stationary ONN. If you have N squared weights in a given layer of your neural network, then fitting that all onto a chip limits the size N of your circuit. And maybe we could just say, this is nanophotonics. You could put millions of these on a chip. But that’s actually not true because nanophotonics is actually not usually in the nanoscale. It’s usually in the micro-scale because you have to be greater than the size of the wavelength of light. So it’s very challenging to scale this up to large enough circuit sizes, where you can get a large advantage here.

And a third track that I think is more overlooked is that when you have deep circuits like this, this leads to cascade ability and error correction issues. And the point here is that if you have errors in the programming of any of these individual MZIs, as the light cascades through the overall error in the matrix is much larger. And so this is very similar to what Logan mentioned with the Physics-Aware training that you can have a deep network, and if each layer is off, then the whole network is off by a large amount. And so if you’re interested, I think that the PMPs are limited in many senses, but it’s still worth exploring. So if you’re interested in how you would solve for some of these error cascading problems, I encourage you to check out the poster that we have available in the poster session, where I can talk about the algorithmic way that you can do this. There is a close tie here to both the concept of a digital twin or a Physics-Aware training in that we’re using a measurement assisted process to compensate out those errors. And then with that algorithm, you can show that you can recover the original canonical performance of these neural networks, even in the presence of errors in the hardware. Now, in addition to the weight stationary ONNs, there’s another class of ONNs called output stationary. And this is the first one that we discovered was the Homodyne based optical neural network based on photoelectric mixing. The idea here is that you don’t encode your weights in phase shifters. The weights are being streamed into your system in the optical domain. And if the streams are on the optical domain, then you’re no longer limited by the number of photonic components on your chip, or actually you’re still limited, but it’s limited by something that scales as N rather than N squared.

So these optic output stationary ONNs can potentially scale to much larger neural network layers than you could get with the weight stationary ONNs, and that leads to much larger parallelism factors, which can lead to much better theoretical performance than you could get with weight stationary schemes. In fact, the performance seemed good enough that we compared it to, in this earlier PRX paper, to the quantum limit. And it seemed like it might be possible to reach a quantum limited performance, kind of what Peter talked about in his talk. Now I’d say we’re not quite there yet, we’re a long way from there yet. Right now, we’re just trying to experimentally demonstrate that this concept works. And there are two directions that we’ve pursued to demonstrate this. One is a free space demonstration, where the goal here is to demonstrate a many mode optical neural network, just to show that this potentially can scale to many modes and with free space, if you look at SLMs, you could get potentially 10 to the six mode.

So that’s a very large, very large neural network would theoretically be possible. The downside is that with existing off the shelf, free-space optics, you’re not going to be particularly fast if you’re using SLMs, this will be at most hertz to kilohertz. So, in addition to that, we’re also looking at integrated demonstrations where we have fewer modes here, but we can drive them much faster. And so this is work that we have, we actually started this before we, before the NTT collaboration, and we’re just starting to get results out of these chips that I think are very exciting.

Finally, I’d like to talk about an approach for optically accelerated edge computing that I think is very interesting because it’s a very different application than these other two that were really designed as stand-in replacements for data centers.

And so this is motivated by the fact that in the IoT era, we’re essentially swimming in sensors and drowning in data. We have too much data coming into our sensors and not enough compute power at those edge sensors. And so the question is, is there a way that optics can help with enabling very effective deep learning inference at edge devices that are very constrained by size, weight, and power. And I guess you can think about this as going to the edge. You dramatically reduce your available resources, but at the same time, we want to also increase the complexity of the problems that we’re solving. And the concept that we’ve come up with that we just recently presented at a conference this summer is called NetCast. And the idea is that you can split your computing problem into two tasks.

One task is related to generating the weight of your neural network and encoding things in an optical format. And that optical signal is then sent to the client, which performs a small amount of post-processing. It just amounts to sending it through a single modulator and then passively de-multiplexing the signal and with that small amount of additional post-processing, you’re able to use this data to perform a complicated task. So in this case to make this more concrete, what we’re doing is a matrix vector product that takes N squared MACs. Most of the costs, the N squared costs, is encoding the data in the optical domain. There’s an order N cost, which is associated with modulating on the client and reading out. So if you have an optical link between a high power server and a low power client, then you can use such a scheme to compute on the client without any real serious size, weight, and power constraints.

And we showed some preliminary theory work that suggests that this can potentially have very high performance. One important question is how much data can you send over that link and how does this limit the performance of the client processor? And you actually find that that’s largely limited by crosstalk, but with the crosstalk limits, you can find that the capacity is actually quite large. So the number of MACs you can perform is quite large. Another important question is the energy limit. And one important limit to the energy is going to be set by the optical power coming in. So we can do analyses that are very similar to what Peter talked about in the experiments that were done in terms of looking at shot noise and Johnson noise, limited performance on matrix multiplication, and benchmarking this against a simple MNIST classification problems and in simulation we found that with the right designs, you could get performance that is again, sub single photon per MAC. And so that’s very encouraging, especially if your link is very lossy and you get very few photons at the client side.

All right. So we’re in the process of trying to realize this experimentally. I have two great students working on this right now. And then we also have a collaboration with a professor over at CSAIL at MIT that’s looking more at the software stack side of this. All right. So thanks for listening to the talk. In conclusion: Why optics? Well optics is important because of these applications: deep neural networks for  learning complex tasks and the trends of the exponential growth in these deep neural networks. And the challenge that with the end of Dennard Scaling, the energy challenges to continuing Moore’s Law are reaching their breaking point. And so what can optics do? I think that the effective use of optical parallelism and the reduction of interconnect bottlenecks is a very important role optics can play.

And we hope to be able to harness that in all the platforms we’ve talked about here, both the weight stationary PMPs, the output stationary Homodyne ONNs, and then also this NetCast for edge computing. Of course, this isn’t just work I did by myself. I’d like to acknowledge all the people involved in this, principally the students that did most of the work Alex, Saumil, and Liane, and then all of our collaborators. And if you’re interested in learning more about this topic, there are a whole bunch of papers that I can recommend. And I promise to you this isn’t just self-promotion. Most of the papers on the left column are just classic papers. And then if you’re interested in our work, that’s on the right column.

Ryan Hamerly

Senior Scientist, NTT Physics and Informatics Lab

Ryan Hamerly first discovered physics in high school, where he taught himself electromagnetism to build a Tesla coil. During college (B.S. 2010, Caltech) he studied theoretical particle physics and general relativity. Since graduate school (Ph.D. 2016, Stanford), Hamerly has pursued research in quantum control, quantum optics, nonlinear optics. His current work focuses on the emerging nexus of photonics, deep learning, quantum computing, and optimization.