By the way... Here are some "freebies" for everyone who wants to work with deep neural network architectures. I'm not getting into my mad-science stuff here, I'm just letting people know there's been a complete revolution in what we can do with Neural Networks in the last ten years, and you can get most of the benefits of it with three simple changes to your code.

First, DON'T USE THE SIGMOID CURVE 1/(1+exp(-x)) as an activation function! It can't be used successfully to train anything more than 2 hidden layers deep. Don't use ANY activation function with exponentially decaying tails, which takes in all the "popular" sigmoids from a dozen years ago, including atan and tanh as well, because while those can be used with G-B initialization for training a 3-layer-deep network or sometimes even a 4-hidden-layer network, they aren't reliable for 4, and aren't usable for anything more.

Instead use an activation function with subexponential tail convergence such as 1/(1 + abs(x)) and, with G-B initialization, (and, well, a lot of computer time) you can successfully train a network 6 or 8 hidden layers deep, just using standard backprop.

Initialization, and a sigmoid function with subexponential tail convergence, are more important than we realized until just about six years ago because if you don't do initialization exactly right, or use the wrong sigmoid, the nodes on the lower layers saturate before the network ever converges (and therefore it will usually converge on one of the crappiest local minima that exists in your problem space)

Careful initialization on most networks means using the Glorot-Bengio scheme - connections between two layers with n nodes and m nodes should be initialized to random values between plus and minus sqrt(6)/(sqrt(n+m)).

But if the networks are wide as well as deep, then even Glorot-Bengio initialization is not your friend due to the law of large numbers, which means you have essentially no gradients to work with at higher layers because inputs are so mixed and everything is so averaged out through all the layers, that there's not sufficient gradient to work with. There's a modification that means calculating the initialization as if one of the layers has fewer nodes, and then hooking up every node in the other layer to just that many randomly selected nodes of that layer. This can leave most of your connections with zero weights and that's okay.

Hmm, what else? Dropout training allows you to build complex models with good generalization, and also avoids overfitting the training data. Simply put, every time you present a training example, randomly select half the nodes in the network and pretend that they don't exist. Double the outputs on the ones that you're using, and run the example. Then do backpropagation on just the nodes you actually used. The results are dramatic.

There are a lot of tougher and more complicated tricks, but just this much - G-B initialization, a sigmoid having subexponential tails, and dropout training - allows ordinary backprop training to reach a hell of a lot deeper, and generalization to work a hell of a lot better, than we ever thought it could a dozen years ago.

First, DON'T USE THE SIGMOID CURVE 1/(1+exp(-x)) as an activation function! It can't be used successfully to train anything more than 2 hidden layers deep. Don't use ANY activation function with exponentially decaying tails, which takes in all the "popular" sigmoids from a dozen years ago, including atan and tanh as well, because while those can be used with G-B initialization for training a 3-layer-deep network or sometimes even a 4-hidden-layer network, they aren't reliable for 4, and aren't usable for anything more.

Instead use an activation function with subexponential tail convergence such as 1/(1 + abs(x)) and, with G-B initialization, (and, well, a lot of computer time) you can successfully train a network 6 or 8 hidden layers deep, just using standard backprop.

Initialization, and a sigmoid function with subexponential tail convergence, are more important than we realized until just about six years ago because if you don't do initialization exactly right, or use the wrong sigmoid, the nodes on the lower layers saturate before the network ever converges (and therefore it will usually converge on one of the crappiest local minima that exists in your problem space)

Careful initialization on most networks means using the Glorot-Bengio scheme - connections between two layers with n nodes and m nodes should be initialized to random values between plus and minus sqrt(6)/(sqrt(n+m)).

But if the networks are wide as well as deep, then even Glorot-Bengio initialization is not your friend due to the law of large numbers, which means you have essentially no gradients to work with at higher layers because inputs are so mixed and everything is so averaged out through all the layers, that there's not sufficient gradient to work with. There's a modification that means calculating the initialization as if one of the layers has fewer nodes, and then hooking up every node in the other layer to just that many randomly selected nodes of that layer. This can leave most of your connections with zero weights and that's okay.

Hmm, what else? Dropout training allows you to build complex models with good generalization, and also avoids overfitting the training data. Simply put, every time you present a training example, randomly select half the nodes in the network and pretend that they don't exist. Double the outputs on the ones that you're using, and run the example. Then do backpropagation on just the nodes you actually used. The results are dramatic.

There are a lot of tougher and more complicated tricks, but just this much - G-B initialization, a sigmoid having subexponential tails, and dropout training - allows ordinary backprop training to reach a hell of a lot deeper, and generalization to work a hell of a lot better, than we ever thought it could a dozen years ago.