Upgrade 2021: phi LAB Speakers

September 21, 2021 // Upgrade 2021: PHI LAB Speakers

How Applying the Backpropagation Algorithm Enables Deep Physical Neural Networks

Logan Wright, Research Scientist | NTT Research Physics & Informatics Lab

Transcript of the presentation How Applying the Backpropagation Algorithm Enables Deep Physical Neural Networks, given at the NTT Upgrade 2021 Research Summit, September 21, 2021.

Logan Wright: Right, so yeah, I’m Logan Wright. I’m with NTT as well as Cornell. And I want to talk to you or at least advertise work that we’ve done, trying to figure out how we can turn really literally any physical system into a deep neural network, which is less ambitious than it sounds. And what I want you to get from this talk at a high level is it, we’re kind of taking a very broad perspective on all natural systems as performing some kind of computation, and we’re harnessing that with two key ingredients of deep learning, namely deep trained features, and the backpropagation algorithm. And something that I’m going to introduce to you that might not be obvious is this simulation reality gap that we need to overcome to actually make this happen in real physical systems.

So you’ve already met Peter, he’s been involved in this work. I want to introduce you to Hiro, who is sitting right over there, who collab this work with me. And of course, as Peter mentioned, this has been a collaboration between NTT and Cornell. So deep learning, there’s lots and lots of hype about deep learning. And I think in, especially in academia, it’s easy to get disillusioned by that, but I have to say like the impact that deep learning has had and in particular the performance improvements that we’ve seen on our such a wide range of tasks, it’s very significant. Even though there’s a lot of hype, there’s really something very substantial going on with deep learning. And for the purposes of this talk, I want to say that, what deep learning is in a nutshell, is learning hierarchical computations from data using the backpropagation algorithm on multi-layer neural networks.

And what I mean by hierarchical computation, is that deep neural network consists of layers trained non-linear functions. And at a very high level what we’re doing is we’re training parameters that we have so that when we put an image into the neural network, it spits out for example, the correct classification. And hierarchy means that this nonlinear function is actually consisting of layers. So we’re taking the output of one nonlinear function and refeeding that into the next one and so on and so on. And this allows the neural network to sort of automatically break down a very complex task into a sequence of very simple tasks. And this is really essential for what deep learning has been able to accomplish. And as a result, deep learning is growing exponentially in a lot of different ways.

And maybe the reason that this is happening is actually, the key thing that deep learning can do, is that it can harness more and more computing power. The more computing power your throughout it, the better results you seem to get, which is kind of, it sort of seems like a panacea from some points of view. And as Peter mentioned, now we’re looking at all the things that have been happening with deep learning, and we’re starting to think in new ways about how we can build computers. Because the calculations that happen in deep neural networks are actually much more like the types of natural computations that happen in noisy analog physical systems. So I have a picture here of Richard Feynman who motivated quantum computing by saying quantum systems can simulate themselves very efficiently, so why not use quantum systems?

We’re now looking at natural physical systems and starting to think can we use those to do computations like those in deep learning? And I do want to point out these papers at the bottom, these were very inspirational for me in terms of understanding and thinking about this point of view is really why physical systems are well-suited to deep neural network calculations in particular. And to quantify that, to attempt to quantify that, in the archive paper that I’m talking about, we go through some theoretical analysis to try and look at a variety of different physical systems that we might wish to make neural networks out of, and to try and quantify how much energy benefit we could get by using this. How much better is it compared to the state of the art? And the numbers that you see here for systems like ultra-fast nonlinear nanophotonics.

This is based on work by Marc Jankowski over there, or by multimode nonlinear fiber optics for example, the potential energy advantage that you get, it’s crazy big. I was shocked by how big these numbers are. But before we leave this slide, I just want to say, this is strictly potential, right? This is if everything goes according to plan. We’re nowhere even close to this. In fact, we’re on the other side, we’re like 10 to the minus six right now. But in principle, if we do everything correctly, there’s a huge, huge benefit, right? These numbers are a million, 10 million, a billion. So there’s a lot to do and a lot to gain by doing this. And one way that people have tried to do this is something called physical reservoir computing. In physical reservoir computing, you take input data and you feed it into a physical system, we call a reservoir, and you measure as that physical system evolves, different parts of the physical system.

And these things that you’re measuring are natural non-linear functions of that input data. And by training a linear output layer, this thing W here, we can learn how to combine those natural functions to approximate some function, Y equals f of X using natural computations. And there’s a lot of advantages of doing this. And it really was inspirational I think for us, because it shows that a very wide range of physical systems provide useful natural computations for doing machine learning. People have made reservoirs that have a bucket of water, octopus arms, as well as more practical systems, including obviously photonics. But reservoir computing is missing some very key details.

I mentioned that deep learning was about trained hierarchical computations. And reservoir computing is inherently shallow. We’re only training this output layer. We’re not training a sequence of trained operations. (indistinct) Is the audio okay?

Moderator: I think so.

All right. (indistinct) Okay, so one solution to this is called deep reservoir computing, very optimistically, I think. And the idea is that you take one reservoir, one physical system, and you feed it’s output into another one and yet another one, and you collect all the measurements from all these systems, you can catenate them into one big, long feature vector, and you train an output layer to combine them. This gives you more features, and some of them maybe are more useful, but in practice they’re not trained hierarchical features. So you’re not getting all the way to deep learning, and as a result, deep reservoir computers cannot achieve the performance that has been achieved with deep learning. So in thinking about what to do, we took a look at the deep neural network layer. This is like the atom of the neural network.

And all of this really is a trainable nonlinear function. And a typical realization is we have some matrix vector multiplication here where we train the matrix to implement the trainable part of the nonlinear function, and we just have element y as non-linearity and this is very loosely, biologically inspired. But actually every physical system, if you think about it is a controllable function. We have any physical system you have, you have some things you can probably control about it, some parameters you can tune. And if you put a signal into that system, as you tune those parameters, you can change the way that that system affects that input signal. That is the natural computation that it’s doing on that by controlling those physical parameters. So the idea we have is we just take the layer and the deep neural network, and we replace it with a controllable physical system and repeating the analogy, we just cascade these controllable physical systems to create deep physical neural networks.

And to be concrete, so we’re having a physical system that is implementing some trainable nonlinear function physically, and we cascade it by feeding the output into another one and another one and so on until we get a final computation result after we’ve trained these parameters properly. And the thing that I really found exciting about this work is that we haven’t really made any assumptions here. Any physical system can now be used to make a physical neural network. And we did make physical neural networks out of a whole bunch of stuff. Now I’m excited about this, I it’s really cool. I will get to some caveats. So don’t stop paying attention now because this sounds really good right now, but there are some issues to pay attention to. Before I get into the issues though, I want to give you one example, which will be familiar to some people in this room.

It’s based on ultra fast nonlinear pulse propagation and quadratic nonlinear optical media. But to some of you, it might be just like complete gobbledygook. All you need to know is that this is some physical system. It’s something that has terahertz bandwidth. It’s something that does a rich number of nonlinear computations, and it looks nothing like what appears in a conventional neural network. So what we’ve done in the lab, me and Hiro put together an experiment where we send an ultra fast pulse into a non linear optical crystal, and we have shaped the spectrum of that pulse, the different frequency components to have data encoded on it, as well as trainable parameters. And these control how the pulse propagates through this nonlinear optical crystal and the output is just the spectrum that we measure of this broadband second harmonic generation.

And this is what it looks like in the lab. And so in practice, what we do is we feed a vowel format frequency vector. So this is some vector that is from a dataset. It’s a series of information about a spoken vowel. And we pass that data into the physical system that compute some trained function and we take the output and we feed it into the next physical system, which is in this case, just the same physical system later in time, but it has new parameters. And we do that five times. And the final output spectrum tells us the predicted vowel and to our delight, it does it correctly most of the time. You can see here, this is the spectrum. This is the energy and the different frequency components and the place where there’s the most energy is the region we’ve chosen to correspond to the vowel a, so it actually does do the correct thing.

And just to remind you, we haven’t done anything digital in the sense of doing matrix multiplication or adding a non linear activation function. This is all just the nonlinear optics doing the job. And here’s a confusion matrix for those of you skilled in the art, you’ll see that this is telling us how often we’re getting the different classes, right? And here’s some examples of what the output spectra look like when it’s classifying different vowels. So, as I mentioned, there are some caveats. This sounds great in principle, but there’s some things we need to work on. And the first one is that yes, we can make every physical system into a neural network, but the vast, vast, vast majority of physical systems will make completely useless neural networks. There’s only like the real special systems that we’ll be able to make good PNNs.

And one of the interesting things is we don’t even really know very well what those physical systems are. It’s kind of now I think an interesting physics problem to understand what physical systems are good at performing machine learning. But we do have some clues and those clues come from what systems have been good reservoir computers, things that are deterministic, they have some noise but not too much, they have non-linearity, but not too much. And they compute features in very high dimensional parameter spaces. And for PNN is they also need to have many controllable parameters. And the final caveat which I’m going to address in this talk, is that we actually need to be able to train those parameters, which is the real tricky part in this business. So in the past one approach that people have used to try and train physical systems to perform computation, are gradient free learning algorithms like genetic algorithms or simulated annealing.

But, well, there’s lots of cool things about these algorithms and you don’t even need a model, you can just have a physical system to train it. These algorithms scale very very poorly to high dimensional parameter spaces. And today’s deep learning models have a billion parameters, even a 100 billion parameters. So these algorithms don’t scale even beyond let’s say a 100 parameters, and as a result, people have not been able to do more than kind of trivial logic operations with them. In deep learning, what they use is the backpropagation algorithm. And the backpropagation algorithm uses a trick where given an analytically differentiable description of the optimization problem, it can efficiently compute gradients that allow you to basically know the perfect direction to go during the optimization. And it gives you a massive improvement in how efficiently you can do this optimization as the number of parameters increases.

So if N is the number of parameters, backpropagation is N times faster at doing an optimization, than gradient free algorithms. And there’s some other nice things about backpropagation and gradient descent as well. So we were initially inspired by something called backpropagation in silico, which is what we’re calling it. There’s work on that in the near-term quantum computing literature, as well as work coming out of Shanhui Fans Group at Stanford, that really inspired us. And that was our initial plan is we’re going to train these physical systems by doing back propagation in simulation. And to do that, we built digital twins of our nonlinear physical system. And over here, you see the inputs, the parameters and the input data. And over here, you see the output of the physical system in blue, and in red you see the output of the digital twin.

And for those of you who are in nonlinear optics, this is ridiculously good agreement. We were super pleased to see this. And we thought, okay, obviously backpropagation and simulation is going to work amazing. And we did it, we had amazing results on the computer over here. We put the parameters into the experiment and we just bawled our eyes out because it was horrible. The experiment didn’t work at all. And the reason that it didn’t work is what I call the simulation reality gap. That is even if you have a really good model of a physical system or any system, it’s only so good. And as you simulate that system for a longer periods of time, or in our case, you feed the output of that physical system into yet another simulation and yet another simulation, the gap between simulation and reality just grows.

In fact, it generally will go exponentially. And in training deep neural networks, you’re also going over many, many training steps. And over that, you’re also experiencing this growth of the error. So basically even if you have a great digital twin, you’re very quickly not going to be able to predict reality and this is just basic error propagation. So I still think this is an absolutely beautiful idea. I think it’s going to be useful for a wide variety of things, including for designing physical neural networks, but it has this absolutely crucial issue of the simulation reality gap. So to fix this, we made one tiny tweak to this algorithm. The way this algorithm works is you have the physical model for this system and use that to perform the computation, and then to perform backpropagation you auto differentiate that simulation model.

And that gives you the gradient that tells you how to update your parameters each time each step of training. And to fix the simulation reality gap, all we did is instead of using the model on the forward pass, we just put in the physical system on the forward pass and this tiny little seemingly trivial difference actually fixes the entire issue. And the reason for it is that during the optimization, the loss that you have and the locations where the gradients are evaluated is grounded in reality. And as a result, when we do this, what we call physics aware training, it works, it works in the experiment. So in silico training transferring it to experiment works terribly and when we did physics-aware training, it works so we were no longer bawling our eyes out.

And this is in some sense not surprising I think. There’s old work in the first wave of neuromorphic computing or the second wave depending on how you want to quantify things, called chip-in-the-loop training. Where basically they built neuromorphic systems that executed neural network algorithms and those things couldn’t be differentiated so they just used a conventional backpropagation algorithm. I think the best way to think about what we’re doing is generalizing this to any kind of physical input-output map to allow us to not only use hardware that has this exact analogy as Peter alluded to, but really any non-linear function that exists in nature. And also we were actually inspired by an algorithm called quantification aware training, which some of you may be aware of. So, okay, as promised we can make any physical system into a neural network.

You know now most of them are not going to be useful, but some of them might be useful. And the cool thing is that now that we can train it using backpropagation, we can make all kinds of crazy networks, we can combine digital and physical, we can combine different physical systems together and using backpropagation, we can train them so they just learn how to work together automatically. We don’t even have to put into any engineering into it, they can figure out how to trade off different parts of the problem to solve it together. And to show this, we made PNNs out of three different physical systems. A mechanical system, which over here you see is a speaker connected to a mechanical plate, we’re literally just a piece of metal that is oscillating. A non-linear analog electronic system, and of course our beloved optical system over here. And in each case, we’re doing handwritten digit image classification.

So this is a simple image classification task. And using PAT we can train them to do two of these task relatively well. And as I alluded to, there’s like a whole bunch of design freedom you have in kind of how to wire these systems together. And so we chose to combine these systems and a whole bunch of different ways which I’ll leave to the paper for you guys to follow how we put these things together. But basically it was just, oh, let’s try a bunch of different things. For those of you sufficiently skilled in the art, you’ll know that the handwritten MNIST digit task is considered to be relatively simple. And it’s not as simple for nonlinear physical systems as it is for traditional convolutional neural networks. But we agree. So what we’ve done recently is we’ve taken a oscillator network very much inspired by the CIM, which is described by a set of equations that look like this.

So there’s, non-linear sinusoidal coupling, and there’s, sinusoidal non-linearity for each oscillator. And to capture the simulation reality gap, we have a model of the physical system that has some mismatch between the model that we use for backpropagation. And we train this to do the fashion MNIST task, which is an image classification task that is much harder. In fact, I struggled to do it. It’s not as hard as other things out there, but it’s much harder than the regular MNIST task. And so when we do this, PAT can train the physical system, remember the simulated physical system, to get pretty good accuracy, even as the model mismatch is very bad. We can also train it very well in their original MNIST task, whereas training and simulation, it fails very spectacularly as you increase the model mismatch.

So what we’ve done is we’ve taken this very broad view of natural computation. We’ve tried to think about how can we can exploit that by adding in these two key ingredients from deep learning. Deep train features, and backpropagation based training, and we overcame this really troublesome issue of the simulation reality gap in order to do it. So deep physical neural networks, physics-aware training, those are the two things that we view as our contributions here and– Hiro are how much time do I have left?

Moderator: You have eight minutes.

All right, so I do have a little bit more that I want to say before we go to questions, maybe to seed some questions. So, one thing that I want to say is that we started out thinking about, hey, we’re going to accelerate machine learning, but actually when you think about it, physical neural networks can do something that regular electronic digital computers can’t do, which is they can process data in the physical domain that it exists. For example, I’m looking at Hiro right now, and it’s not coming to me as ones and zeros on a computer because we’re in reality I actually get to see it coming to me as photons. If a physical neural network can process that information as photons or as sound and so on. So I think, and we think broadly that maybe we’re PNNs, especially in the near term, will have their greatest impact, is in designing functional physical objects and physical systems that operate or produce physical data rather than digital data.

And these are things like smart sensors, smart generators, as well as maybe even more crazy things like robots. So if you’re so inspired to make your own physical neural network, the way that I think you probably want to go about doing it is you first want to find your favorite physical system or some physical system you have a hunch might be able to do a complex computations. You would start with the CIM, you could start with a multimode laser or any of these kinds of systems, if you have some sufficient reason for it. And then I recommend using backpropagation in silico to understand the computations that that physical system does. And I know that there’s some issues with this algorithm, but it really allows you to understand where and when that physical system can be useful.

And I also think that it’s probably good to use that time to design the architecture, like some of these things that I’m showing here and to pre-train the system before you start to use PAT. And then last, finally, you can actually make the device and you can use PAT to make it work in reality. So that’s all I have to say. Deep physical neural networks, so these are deep neural networks where we’ve replaced the mathematics with physics and physics-aware training, which is the backpropagation algorithm to train controllable physics to do machine learning tasks, or other functional, all of the functional things we wanted to do. (clapping)

Moderator: Any questions?

Attendee: How do things such as neural network verification work on these physical systems? Can you provide any guarantees in terms of what you’re going to get as an output is what you expect, for a given physical system?

Logan Wright: So it works, it could work basically the same way as it works for normal neural networks. The difference is that if you don’t control the fabrication of your devices very well, each one is going to work differently. But so, the overall answer is it’s not any better than conventional neural networks in that regard, except for one particular thing, which is that normal noise and imperfections and physical systems are very closely analogous to the data augmentation and dropout and other things that people do to improve generalization and to protect against adversarial examples and so on. So very speculatively thinking, it’s possible that physical systems might be able to be more resilient to things that you might be worried about people attacking your neural network with. But that’s like, we’re not even close to thinking about that.

Attendee: Very interesting idea. I’m curious, sir, about deterministic or gradient descent system is really the final, a sort of a answer through these effort in the combinatorial optimization study in the past five to 10 years, I think, just told us, the simple gradient descent doesn’t actually perform very well for hard programs. If the given program is rather complicated and hard instances. Maybe introduction of directional coupling, and therefore are put the system outside of the simple gradient decent actually improves the performance or some kind of stochasticity not that completely deterministic evolution also help sometimes. And have you thought about those sort of modification or some of the lessons we learnt from optimization to backpropagation process?

Logan Wright: We have, we’ve just started to think about it. Right, our initial goal was everybody is using stochastic gradient descent to train deep neural networks, so we wanted to get it to work like that first. But we’ve actually been thinking about, could we instead train neural networks that learn how to exploit physics, not only for performing the forward pass, but can we train neural networks that learn how to use physics to also assist in the training? Right, ’cause the training is a hard thing and we have systems like the CIM that can do optimization very well. So, yeah, we’ve been thinking about, can we train physical systems that learn how to learn? And the goal there is we would start with a neural network that we would train how to predict the updates, but eventually we would be able to actually replace that with a physical system.

And yeah, the goal there would be, we’d be able to basically discover ways to use physics, and that would include things like noise and quantum noise and natural imperfections as I mentioned that would help make it more robust. We could find ways of using physical systems to assist in that training. So PAT we think is really just the first step. It’s like Generation One but I think probably before these things will be useful, we’ll actually have algorithms like this, that use physical systems also for the training.

Logan Wright

Research Scientist | NTT Research Physics & Informatics Lab

Dr. Logan Wright joined NTT Research in 2018 after receiving his PhD in Applied Physics from Cornell University. At NTT Research, he studies the physics of computation and its application to new computing machines and paradigms.

Your Privacy

When you visit any website, it may store or retrieve information on your browser, mostly in the form of cookies. This information might be about you, your preferences or your device and is mostly used to make the site work as you expect it to. The information does not usually directly identify you, but it can give you a more personalized web experience. Because we respect your right to privacy, you can choose not to allow some types of cookies. Click on the different category headings to find out more and change our default settings. However, blocking some types of cookies may impact your experience of the site and the services we are able to offer.