12-28-2015, 04:34 AM
I was trying to find a solution (or at least a heuristic) for the Stable Initialization Problem, which is actually relatively minor in Artificial Neural Networks these days. Because autoencoding strategies allow for training deep networks, Stable Initialization has turned into more of a "would be nice to have" than a "blocks all progress" sort of problem.
But one of the things I tried turns out to crack the Catastrophic Forgetting Problem, the Exploding and Vanishing Gradient Problems, and automates large-scale Adaptive Learning Rates on the basis of purely local information. And those four things are MAJOR problems.
But, I'm speaking geek at this point. I'm excited about it but when I talk about it I rapidly get incomprehensible to nonspecialists. To break it down a bit;
A crack on Exploding Gradients means Deep neural networks (which we can sort of train with autoencoders but autoencoders aren't task-oriented) can be trained directly, using ordinary Backpropagation. Up to now everything we've got for doing that has required nonlocal information hence compute requirements explode on any non-toy problems. It also means recurrent ANNs (ie, the kind which do tasks requiring planning, strategy, sequence recognition, long- and short- term memory etc) can be trained directly using Backpropagation Through Time - and autoencoding has never been stable for nontrivial temporal patterns. The best we could do up to now on BTT was output forcing.
A crack on Vanishing Gradients means areas of the fitness space far from any solution, where there's very slight evidence to base learning on, if any, can now be navigated. And this is also applicable to recurrent networks, where up to now every strategy we had for vanishing gradients (momentum, nonlocal methods) was unstable or undefined.
Adaptive Learning Rates are closely related to both. The "canonical" way of dealing with exploding gradients is by reducing the learning rate and the "canonical" way of dealing with vanishing gradients is to increase it but damnit you can't do both at once! Doing both at once is what you need to deal with a type of pathological region in the solution space where you have both exploding gradients in some weights AND vanishing gradients on others. Up to now pretty much all the methods we had for dealing with it involve very carefully figuring out how to remap the solution space, and all of those were global second-order methods, meaning they were prohibitively expensive on anything bigger than a toy problem. Most people attempted to fake it using Momentum, which is not very stable in deep networks, completely unstable for recurrent networks, and has never solved the problem very well anyway.
And a crack on Catastrophic Forgetting means ANNs already trained on one task, can now be trained for additional tasks without forgetting (much) about how to do the first one, and switch back and forth between them pretty rapidly thereafter, whenever you want - whether at short or long intervals, even with continued training. And it can leverage things it learned for the first task, in doing the second - and things it learned for the first two tasks, in doing the third - and etc. And that? That's more important than any of the others, because it's something that nobody has figured out ANY way to do until now. That's been one of the things we need to figure out before we can make 'real' AIs.
So - General Artificial Neural Networks that remember many skills and facts, and are arbitrarily deep and/or recurrent are now possible. This is much better ways to solve three problems than any we had, and at least A Way to solve a problem we've never been able to solve before, and a whole new realm of possibilities for recurrent networks and subsymbolic temporal reasoning.
This doesn't crack consciousness as such yet, but these are ALL things needed for artificial consciousness. We probably need a lot more, but now that we've got these we can get enough results maybe to figure out where the next problem after this is.
And all that means I've got at least five major papers I've got to write. So the bad news is that work is going to eat my brain for at least the next 3 months and after that, I dunno - conference schedules, etc. I got the initial successful results, w00t! But now I need to design carefully controlled experiments using standard data sets, fire up the Excessive Machine and leave it running for weeks to do all the iterated network training to gather statistics so I can write up comparisons, pin down comparable controls so I can demonstrate how the behavior with the new technique differs, do the writeups and research statements, search for precedents, prepare bibliographies, read the prior research, relate the current finding to it, annotate the code, start writing the papers, submit them to journals, deal with conferences and etc, etc, etc. It's ten percent inspiration, but even in research, it's still a whole lot of work and a whole lot of explanation.
And, um, utility bills. If I leave the Excessive Machine running for a month I'm definitely going to see it on my electric bill. PG&E is going to send me a little note that says I probably have a short somewhere and which of my breakers is tripping and why haven't I repaired it? On the plus side, my gas bill for heat is going to drop by nearly the same amount.
I can see why it works for these major problems now, after the fact and after spending some days doing hardcore analysis. I just never would realized I ought to try it had those been the problems I were actually trying to solve. I wouldn't even have stated the problems on the level of actually doing the network training, in a way that admitted this as a solution.
So General Artificial Intelligences just got about four steps closer to being reality. And I'm going to be effectively disappearing for a while.
But one of the things I tried turns out to crack the Catastrophic Forgetting Problem, the Exploding and Vanishing Gradient Problems, and automates large-scale Adaptive Learning Rates on the basis of purely local information. And those four things are MAJOR problems.
But, I'm speaking geek at this point. I'm excited about it but when I talk about it I rapidly get incomprehensible to nonspecialists. To break it down a bit;
A crack on Exploding Gradients means Deep neural networks (which we can sort of train with autoencoders but autoencoders aren't task-oriented) can be trained directly, using ordinary Backpropagation. Up to now everything we've got for doing that has required nonlocal information hence compute requirements explode on any non-toy problems. It also means recurrent ANNs (ie, the kind which do tasks requiring planning, strategy, sequence recognition, long- and short- term memory etc) can be trained directly using Backpropagation Through Time - and autoencoding has never been stable for nontrivial temporal patterns. The best we could do up to now on BTT was output forcing.
A crack on Vanishing Gradients means areas of the fitness space far from any solution, where there's very slight evidence to base learning on, if any, can now be navigated. And this is also applicable to recurrent networks, where up to now every strategy we had for vanishing gradients (momentum, nonlocal methods) was unstable or undefined.
Adaptive Learning Rates are closely related to both. The "canonical" way of dealing with exploding gradients is by reducing the learning rate and the "canonical" way of dealing with vanishing gradients is to increase it but damnit you can't do both at once! Doing both at once is what you need to deal with a type of pathological region in the solution space where you have both exploding gradients in some weights AND vanishing gradients on others. Up to now pretty much all the methods we had for dealing with it involve very carefully figuring out how to remap the solution space, and all of those were global second-order methods, meaning they were prohibitively expensive on anything bigger than a toy problem. Most people attempted to fake it using Momentum, which is not very stable in deep networks, completely unstable for recurrent networks, and has never solved the problem very well anyway.
And a crack on Catastrophic Forgetting means ANNs already trained on one task, can now be trained for additional tasks without forgetting (much) about how to do the first one, and switch back and forth between them pretty rapidly thereafter, whenever you want - whether at short or long intervals, even with continued training. And it can leverage things it learned for the first task, in doing the second - and things it learned for the first two tasks, in doing the third - and etc. And that? That's more important than any of the others, because it's something that nobody has figured out ANY way to do until now. That's been one of the things we need to figure out before we can make 'real' AIs.
So - General Artificial Neural Networks that remember many skills and facts, and are arbitrarily deep and/or recurrent are now possible. This is much better ways to solve three problems than any we had, and at least A Way to solve a problem we've never been able to solve before, and a whole new realm of possibilities for recurrent networks and subsymbolic temporal reasoning.
This doesn't crack consciousness as such yet, but these are ALL things needed for artificial consciousness. We probably need a lot more, but now that we've got these we can get enough results maybe to figure out where the next problem after this is.
And all that means I've got at least five major papers I've got to write. So the bad news is that work is going to eat my brain for at least the next 3 months and after that, I dunno - conference schedules, etc. I got the initial successful results, w00t! But now I need to design carefully controlled experiments using standard data sets, fire up the Excessive Machine and leave it running for weeks to do all the iterated network training to gather statistics so I can write up comparisons, pin down comparable controls so I can demonstrate how the behavior with the new technique differs, do the writeups and research statements, search for precedents, prepare bibliographies, read the prior research, relate the current finding to it, annotate the code, start writing the papers, submit them to journals, deal with conferences and etc, etc, etc. It's ten percent inspiration, but even in research, it's still a whole lot of work and a whole lot of explanation.
And, um, utility bills. If I leave the Excessive Machine running for a month I'm definitely going to see it on my electric bill. PG&E is going to send me a little note that says I probably have a short somewhere and which of my breakers is tripping and why haven't I repaired it? On the plus side, my gas bill for heat is going to drop by nearly the same amount.
I can see why it works for these major problems now, after the fact and after spending some days doing hardcore analysis. I just never would realized I ought to try it had those been the problems I were actually trying to solve. I wouldn't even have stated the problems on the level of actually doing the network training, in a way that admitted this as a solution.
So General Artificial Intelligences just got about four steps closer to being reality. And I'm going to be effectively disappearing for a while.