The Orion's Arm Universe Project Forums





w00t! I solved a bunch of major ANN problems!
#1
I was trying to find a solution (or at least a heuristic) for the Stable Initialization Problem, which is actually relatively minor in Artificial Neural Networks these days. Because autoencoding strategies allow for training deep networks, Stable Initialization has turned into more of a "would be nice to have" than a "blocks all progress" sort of problem.

But one of the things I tried turns out to crack the Catastrophic Forgetting Problem, the Exploding and Vanishing Gradient Problems, and automates large-scale Adaptive Learning Rates on the basis of purely local information. And those four things are MAJOR problems.

But, I'm speaking geek at this point. I'm excited about it but when I talk about it I rapidly get incomprehensible to nonspecialists. To break it down a bit;

A crack on Exploding Gradients means Deep neural networks (which we can sort of train with autoencoders but autoencoders aren't task-oriented) can be trained directly, using ordinary Backpropagation. Up to now everything we've got for doing that has required nonlocal information hence compute requirements explode on any non-toy problems. It also means recurrent ANNs (ie, the kind which do tasks requiring planning, strategy, sequence recognition, long- and short- term memory etc) can be trained directly using Backpropagation Through Time - and autoencoding has never been stable for nontrivial temporal patterns. The best we could do up to now on BTT was output forcing.

A crack on Vanishing Gradients means areas of the fitness space far from any solution, where there's very slight evidence to base learning on, if any, can now be navigated. And this is also applicable to recurrent networks, where up to now every strategy we had for vanishing gradients (momentum, nonlocal methods) was unstable or undefined.

Adaptive Learning Rates are closely related to both. The "canonical" way of dealing with exploding gradients is by reducing the learning rate and the "canonical" way of dealing with vanishing gradients is to increase it but damnit you can't do both at once! Doing both at once is what you need to deal with a type of pathological region in the solution space where you have both exploding gradients in some weights AND vanishing gradients on others. Up to now pretty much all the methods we had for dealing with it involve very carefully figuring out how to remap the solution space, and all of those were global second-order methods, meaning they were prohibitively expensive on anything bigger than a toy problem. Most people attempted to fake it using Momentum, which is not very stable in deep networks, completely unstable for recurrent networks, and has never solved the problem very well anyway.

And a crack on Catastrophic Forgetting means ANNs already trained on one task, can now be trained for additional tasks without forgetting (much) about how to do the first one, and switch back and forth between them pretty rapidly thereafter, whenever you want - whether at short or long intervals, even with continued training. And it can leverage things it learned for the first task, in doing the second - and things it learned for the first two tasks, in doing the third - and etc. And that? That's more important than any of the others, because it's something that nobody has figured out ANY way to do until now. That's been one of the things we need to figure out before we can make 'real' AIs.

So - General Artificial Neural Networks that remember many skills and facts, and are arbitrarily deep and/or recurrent are now possible. This is much better ways to solve three problems than any we had, and at least A Way to solve a problem we've never been able to solve before, and a whole new realm of possibilities for recurrent networks and subsymbolic temporal reasoning.

This doesn't crack consciousness as such yet, but these are ALL things needed for artificial consciousness. We probably need a lot more, but now that we've got these we can get enough results maybe to figure out where the next problem after this is.

And all that means I've got at least five major papers I've got to write. So the bad news is that work is going to eat my brain for at least the next 3 months and after that, I dunno - conference schedules, etc. I got the initial successful results, w00t! But now I need to design carefully controlled experiments using standard data sets, fire up the Excessive Machine and leave it running for weeks to do all the iterated network training to gather statistics so I can write up comparisons, pin down comparable controls so I can demonstrate how the behavior with the new technique differs, do the writeups and research statements, search for precedents, prepare bibliographies, read the prior research, relate the current finding to it, annotate the code, start writing the papers, submit them to journals, deal with conferences and etc, etc, etc. It's ten percent inspiration, but even in research, it's still a whole lot of work and a whole lot of explanation.

And, um, utility bills. If I leave the Excessive Machine running for a month I'm definitely going to see it on my electric bill. PG&E is going to send me a little note that says I probably have a short somewhere and which of my breakers is tripping and why haven't I repaired it? On the plus side, my gas bill for heat is going to drop by nearly the same amount. Dodgy

I can see why it works for these major problems now, after the fact and after spending some days doing hardcore analysis. I just never would realized I ought to try it had those been the problems I were actually trying to solve. I wouldn't even have stated the problems on the level of actually doing the network training, in a way that admitted this as a solution.

So General Artificial Intelligences just got about four steps closer to being reality. And I'm going to be effectively disappearing for a while.
Reply
#2
(12-28-2015, 04:34 AM)Bear Wrote: I was trying to find a solution (or at least a heuristic) for the Stable Initialization Problem, which is actually relatively minor in Artificial Neural Networks these days. Because autoencoding strategies allow for training deep networks, Stable Initialization has turned into more of a "would be nice to have" than a "blocks all progress" sort of problem.

[snip]

So General Artificial Intelligences just got about four steps closer to being reality. And I'm going to be effectively disappearing for a while.

Congrats! Looking forward to whatever stories you can share while you are on the journey, and even more to what you might share on your return.
Stephen
Reply
#3
Does this mean we be getting our robot girlfriends soon?
Evidence separates truth from fiction.
Reply
#4
even a robot built to love you could not love you, my friend

joking xD seriously though, well done Bear! Big Grin
James Rogers, Professional Idiot
Reply
#5
(12-28-2015, 04:34 AM)Bear Wrote: I was trying to find a solution (or at least a heuristic) for the Stable Initialization Problem, which is actually relatively minor in Artificial Neural Networks these days. Because autoencoding strategies allow for training deep networks, Stable Initialization has turned into more of a "would be nice to have" than a "blocks all progress" sort of problem.

But one of the things I tried turns out to crack the Catastrophic Forgetting Problem, the Exploding and Vanishing Gradient Problems, and automates large-scale Adaptive Learning Rates on the basis of purely local information. And those four things are MAJOR problems.

I know some people who are interested in general artificial intelligences; is there anything you can share about the nature of your insight, publicly or privately?
Thank you for your time,
--
DataPacRat
"Does aₘᵢₙ=2c²/Θ ? I don't know, but wouldn't it be fascinating if it were?"

Reply
#6
Can't say *too* much about the secret sauce before I have papers at journals, but I can give a quick outline.

The problem was that in conventional neural network architecture all the connections have the *SAME* learning rate. If the learning rates are *DIFFERENT* for different connections, ordinary backprop given unstable rates will teach the network as a whole to use nodes fed by connections with lower learning rates, which are still stable - and in so doing downregulate the error attributable to the unstable connections meaning they get less training and become stable as a side effect. Conversely, if all connections are stable, backprop rapidly drives the later nodes to use the highest-learning-rate connections.

The *effective* learning rate of both weights combined, given weights with learning rates a "reasonable" factor different from each other, winds up balanced at just about the maximum it can go to while remaining stable, provided it's higher than the lowest available rate and lower than the highest available rate. And you can cover a huge dynamic range with an exponential distribution by having about eight different learning rates each of which is about a quarter of the one before it.

A solution to Catastrophic Forgetting falls out because the first task is encoded in low-rate connections. It doesn't get forgotten because whatever it *doesn't* have in common with the second task gets the later connections that depend on it rapidly downregulated until training just about isn't happening at all. Conversely whatever it *does* have in common with the second task continues to get trained, and is also useful in combination with the training preserved in the very low-rate nodes when switching back to the first one. If you switch between tasks occasionally (once a day or once a week) as far as the lowest-rate nodes are concerned they're getting trained on whatever is common to both and, if not, they effectively don't get trained during the times when they're "inapplicable" to the current task.
Reply
#7
Congratulations Smile When you publish be sure to let us know
OA Wish list:
  1. DNI
  2. Internal medical system
  3. A dormbot, because domestic chores suck!
Reply
#8
Academic publishing is kind of a new thing for me. Up to now I've only gotten patents, and they were Work For Hire so I don't get any of those royalties.
Reply
#9
So, after actually doing the experiment rigorously, I found two really important, completely unexpected results which I hadn't even noticed in the first, seat-of-the-pants empirical testing.

Most importantly: the networks with a mix of learning rates are far less likely to converge on local peaks in the fitness landscape, and find a global solution.

Artificial Neural Networks are kind of like a blind mountain climber; they can tell which way is "uphill" where they are, and they can proceed until they can't find "uphill" in any direction, but they don't know whether they've climbed the tallest mountain in the area; only that they're at the top of something.

When the very slow-learning weights are considered as constants, a fitness landscape is revealed in which there are certain directions you can't move, but you can find a highest point in the directions you can move - at least a local peak and possibly a global peak. A global peak is defined as the behavior of the network matching the ideal - which may be achievable with many different combinations of weights, in which case you have many global peaks all at the same height.

If it's a global peak, then movement of the slow-learning weights won't reveal anything you haven't already achieved. But if it turns out that it's only a local peak, then movement of the slow-moving weights gradually reveals a whole new fitness landscape - in which the altitude you've already reached is never reduced, and the fast-moving weights are likely to discover a new path for rapid upward climbing.

Rinse, repeat. In order to truly be a local fitness peak, a given achievable behavior must be a local peak in all of a long succession of different fitness landscapes. Which is different from being the first peak found in a fitness landscape of simple higher dimensionality because in the process of successively converging on newly-revealed "local peaks" the fast-moving weights do a heck of a lot of exploring the solution space that otherwise wouldn't happen, and in a highly nonlinear space, that truly matters.

Second in importance: they do it without needing conjugate-gradient methods or Levenberg-Marquardt, which are really awesome methods but intractable on very large networks. Conjugate-gradient requires second-order derivatives on the square of the number of nodes in the network, and Levenberg-Marquardt is even worse. They give you networks of awesome efficiency for their size, but .... they limit the size, and therefore the complexity of problems you can attack. Badly.

So... I'm a bit boggled by this whole thing. I'm more than a bit boggled by my experimental results. We can use smaller networks now and still find good optima, which speeds the whole thing up and once the configurations are discovered drastically reduces the amount of compute power required by the resulting networks. Smile Of course we're also that much more likely to overfit our training data that much more precisely. Dodgy And we can use bigger networks and have benefits like those of the second-order methods, without paying impossible prices in compute time for them. Angel

And I'm going heads-down again, finding cites and preparing graphics and looking at the conference schedules. And also going into mad-scientist mode and having a crack at consciousness.... Not with an expectation of success, but with the knowledge that I just kicked a few of the long-standing obstacles out of the way. Now we can get past them and see what other obstacles there are....
Reply


Forum Jump:


Users browsing this thread: 2 Guest(s)