Why neural networks can learn anything
In the last few years, the world has been turned upside down due to the advent of large language models or LLMs, because you can now talk to a computer in English and get reasonable, often informative responses. However, this is not that much of a surprise to anyone working with neural networks from the past decades. It was only a matter of time that it would happen. In the following article, we will see why neural networks are self-programming computers that can learn any computation from just looking at input-ouptut pairs, lots of them.
How a neural network works
Before we understand how much of the thinking maybe done in a neural network, we need to look at how it works from a computational model point of view. The idea is to build a computer that can program itself by simply looking at the input-output pairs of the program it is supposed to run. Why would we want to do that? Suppose we have a problem of reading handwritten text. We know there has to be some program that can read handwritten text, because we can do it and we assume that Turing hypothesis was that any computation can be done on a Turing Machine. But we do not know how to write this program. But have examples of text that are both handwritten and in ASCII for as samples; so we would like our machine to program itself by simply consuming the examples we have.
It is difficult to write such a program in Turing’s model of computation as a machine that changes its state and easier to do it with the circuit model. Any computation can be written as a boolean circuit with NAND gates or with AND and NOT gates. We don’t even mind having more gates at our disposal. So, if there is a program that can read handwritten text, there is a boolean circuit that can do it too. Let us consider that the handwritten text is an image in 2-color black-white form with one bit per pixel. Each of these pixels is an input wire into the circuit. The output can be a series of bits representing the text in binary format. Now, there has to be some circuit that does this computation, we have to learn this computation. So we need to make a machine that has programmable gates and connections between gates, very similar to an FPGA or a field-programmable gate array. However, we also only consider a layered circuit – i.e. the gates are arranged in layers and the inputs of a layer can only come from the layer below it. Any circuit can be modified to take this form by allowing more gates than the original unrestricted form of circuit. This way it is easier to provide a programmable circuit. We create a programmable layered circuit so that there is a possible link between every gate in a layer to the input of every gate in the layer above it. These links can be turned on by programming. We must also allow the programming of the gates themselves to take up any truth table. With these two types of programmability, we can produce any layered circuit.
But automatically programming this discrete programmable circuit is hard. On the other hand, a continuous function continuous can be optimized by using gradient descent. So we relax our gates to be continuous, where the wires can take any value between 0 and 1. Given two inputs, we can design all gates using the following formula —
\[ g(x,y) = f(ax+by+c) \]
where the function \( f(z) \) is defined to be 0 for \( z < 0 \) and 1 for \( z \ge 0 \). The following table shows how to choose the constants \( a,b,c \) to get different boolean gates.
Gate | a | b | c |
---|---|---|---|
AND | 1 | 1 | -2 |
OR | 1 | 1 | -1 |
NAND | -1 | -1 | 1.5 |
NOT | -1 | 0 | 0.5 |
Note that in case of a NOT gate, we just ignore the second input because NOT gate has only one input. Do check that the truth tables do match for each gate. However, even this construct cannot be used for gradient descent since \( f \) is not continuous. So, we do continuous relaxation of the \( f(z) \) as \( \frac{e^z}{1+e^z} \) and call it the sigmoid function. In case of a layered circuit with programmable connections, we define a gate by considering all the possible inputs. We can use \( \mathbf{x}_i \) to consider all the wires in the layer \( i \). Then we can write –
\[ g(\mathbf{x}) = f(\mathbf{a} \cdot \mathbf{x}) + c \]
where \( \mathbf{a} \) is the vector of parameters and \( c \) is the parameter called the bias. If we allow \( \mathbf{x} \) to always have a 1 as an extra dimension, we can make \( c \) a part of \( \mathbf{a} \). Then we can write —
\[ g(\mathbf{x}) = f(\mathbf{a} \cdot \mathbf{x}) \]
If we consider all the gates, we can have a vector \( \mathbf{a}_j \) for the \( j^{\text{th}} \) gate. Altogether, we can write —
\[ \mathbf{x}_{i+1} = g(\mathbf{A}_i\mathbf{x}_i) \]
where \( \mathbf{A}_i \) is a matrix, the \( j^{\text{th}} \) row of which is \( \mathbf{a}_j \). We can now determine the matrices \( \mathbf{A}_i \) through gradient descent. The above idea I described is exactly what a feed-forward multi-level perceptron (MLP) is. Now we know why neural always works for every problem — the only thing we need to take care is to have a sufficiently large network and being able to reach the global minimum during gradient descent.
Note that from this point of view, it does not matter much what your base model is. As long as it can 1. model all computation under some computational bound, and 2. It can be trained, i.e. the difference from the data can be minimized with an efficient algorithm, it can be trained to do exactly the same thing. The two important problems that stops us from basically making a perfect machine is the difficulty of training. Firstly, the optimization problem is NP-Hard, so we take different heuristic routes to nearly optimize it instead. Secondly, we must have enough data to specify our target function well enough and with enough detail. Too small amount of data cannot have enough information to specify all the information we need to effectively learn a function. Therefore, different activation functions, connection topology, other modifications like an attention network, etc. have been employed effective to solve many learning problems. The important thing is weather it can be trained efficiently, and all modifications are geared towards making the training more efficient in terms of both computation and data.
Are there other models?
Yes, lots of them. In fact, when we use the transformer models we use in the large language models is a prime example. The attention heads are a different model from what we described above, although, LLMs also have MLPs like we described above in them as well. The point of using other systems is to be able to map specific types of computation (like next word prediction, artificial vision, fingerprint recognition etc.) better with a smaller network than the size of a multilevel perceptron would be to map those functions. When we have a smaller network, it is much easier to train it — i requires much less memory, it avoids a lot of problems with the training and it generalizes better instead of learning the minor fluctuations or errors in the training data because of much smaller number or parameters. I will list a few more notable example –
- Higher order neural network You have layers of linear transformation or tensor product or the input with itself. So, we have a matrix \( W \) of size \( n \times (n+1)^2 \) such that \( x_l = W (\{x_{l-1}|1\} \otimes \{x_{l-1}|1\}) \) where the input where \( | \) is concatenation and \( l \) is the layer. When stacked in many layers, it makes a polynomial of the input. We can even have higher order tensor products. The problem is with how big the matrix becomes.
- Convolutional Neural Network This is used for image processing. In image processing, it makes sense to first compute local features and then compose them into global features. So a small network is applied on overlapping windows of the image for processing and the concatenation of the output vectors are then passed onto the global network.
- Recurrent Neural Network Used to be applied to text processing, it applies the same network over and over again to produce both an output and a hidden state. The hidden state is used as part of the input in the next step. The idea is to use the hidden state as a working memory between multiple applications of the same network. It is still used in conjunction with LLMs to extend their context window.
There are plenty of other such networks. One thing to understand is all these modifications are generally to facilitate optimization and representing specific types of functions in less number of parameters. As such any of these can learn any a function given enough size, training data and compute power.
What stops it from making AGI?
- Size of the model Any neural network is a bounded circuit, i.e. depending on the size of the computation, you might need bigger and bigger models. Bigger models can hold more intricate patterns, thus improving prediction. However, a bigger model requires more computational resources to hold and process. We are indeed hitting the limits of computational power and overcoming it all the time. No one is sure at what size it becomes AGI. But using the scaling laws and comparing with human performance, the current estimation can be trillions of times the current computational power is needed if we use the current models of LLMs.
- Size of the data To specify an intricate function, you need a lot of data. In this context, the Kolmogorov complexity is of importance. The Kolmogorov complexity of a string is the minimum size of a program written in a specific language ( or for a specific Universal Turing Machine) that can generate the string. If it is a string with a simple structure like \( “1, 10,11,…,1111_{m\ \text{times}}” \), it takes a very small program to generate. On the other hand for a random string, the program size would be a little more than the size of the string (it would literally have to print the string). The Kolmogorov complexity of a function on a finite domain is the same as the Kolmogorov complexity of the string that lists outputs of the function in order, like \( f(0),f(1),f(2),… \) for all inputs in the input domain. Notice that for a simple logical function, this is also small. Now, think about the model and the data are two different representations of the same function. So, the data cannot express a function of higher Kolmogorov complexity. In practice, the data required to train a model is several times more than the size of the model because data contains errors, thus it needs redundancy in it to properly represent the function . As such, we are using the whole internet’s data to train the frontier models.
- Optimization The optimization of a highly non-convex function is NP-hard, i.e. there is no efficient algorithm to truly reach the optimal parameters. We can however reach near optimal or at least good-enough parameter values by variations of gradient descent. But this is more of a heuristic exercise than a proven correct algorithm. We try and specialize models to be able to optimize with the available computational power.
Summary
Neural networks are Turing equivalent, or more precisely, bounded Turing equivalent, just like the computers we use in day to day life. They can do all possible computation up to a certain size. A laptop’s computational power in terms of what computation it can possibly do is bounded by its memory (that includes its SSD). In case of a neural network, its computational power is limited by the number of learnable parameters it has. This bound is a major issue in modeling all sorts of learning tasks. We are not sure how big a model we must need to build AGI, but it should be very large. If we use the current methods, we can compute from the scaling laws that an AGI would need trillions of times bigger LLMs than the ones we currently use. This is clearly infeasible. That is why researchers always find new ways to make models that are more efficient in learning – both in terms of computational power, and in terms of the data it needs to learn.