Can we de-parameterize Neural Networks?
by The Bayesian Observer
During the last 3 months, I have been constantly surprised and impressed by the recent achievements of neural networks. In particular, convolutional neural networks (see the work of Yann lecun) are currently the only technique that allow real-time object recognition using video input, with reasonable performance. Another piece of neural networks magic that I have been delighted by recently is autoencoders. In particular, a 2006 paper [PDF] in the journal Science by Ruslan Salakhutdinov and Geoffrey Hinton shows how multi-layer autoencoders (also called ‘deep autoencoders’) can be used to learn non-linear relationships. Yet another cool use of neural network that I recently came across is: paraphrase-detection on long sentences using recursive autoencoders — the problem here is to detect, given two sentences, whether they are the same (or close enough) in meaning or not — Here is the paper by Richard Socher, whose talk at NYU is where I first heard this stuff.
However, I have been disappointed by the fact that there is no systematic way of determining the right number of layers, and the size of each hidden layer that one must use in a neural network. Intuitively, there seems to be a relationship between the number of layers needed and the ‘amount of non-linearity’ in the problem. For example, suppose one is required to find the classification boundary in N-dimensional space, between points belonging to two classes. If this boundary is ‘highly non-linear’, meaning that it is very contorted, then a deeper network is probably a good idea because each additional layer allows the neural network to afford a greater capacity for learning a non-linear function since non-linear units in the hidden layers at as building blocks for the non-linear surface. The issue is the standard one of model complexity versus data complexity — a more complex model (e.g. a deeper neural network) is more capable of capturing complex data if the data really is complex enough to begin with. If however, the data isn’t too complex then too complex a model can perform poorly if there isn’t sufficient data.
In a way, the fact that one needs to have some idea of what neural network architecture to use is a bit like the problem with parametric Bayesian methods. The recent development of Bayesian non-parametrics (see for example, the work of Michael Jordan, Yee Why Teh, David Blei and Zhoubin Ghahramani) is an effort to fix that problem. Just like parametric Bayesian methods require the specification of a parametric family of distributions, thereby limiting the complexity that the model can capture, neural networks require the specification of an architecture, which if not well-tuned to the data at hand, can underfit or overfit the data. The paramaters in the case of neural networks are the number, size and type of layers in the network.
It appears to me that a similar effort is needed in the domain of neural networks — i.e. come up with a systematic approach to estimate how much ‘non-linearity’ or complexity is present in data, so as to estimate good neural network architecture suitable for the problem at hand. At present, practitioners seem to mostly employ trial-and-error to guide the selection of an appropriate architecture. I, for one, would start using neural networks much more frequently in my research, were there a method to estimate a good architecture.