In this post I show some visual results from training a neural network (built to mimic the controlled hallucination understanding of the brain) to predict the next image in a very simple (but noisy) video.
Experience is a Controlled Hallucination
Anil Seth says that your experience of the world is like a controlled hallucination (see his TED talk for more on this). It's like a hallucination because it is generated by the brain (as an attempt to predict its inputs). It's controlled because it is strongly constrained by the actual inputs you are predicting (which is less the case in a normal, uncontrolled hallucination).
A Controlled Hallucination Network
The network in this post takes a single frame as input and attempts to predict the following frame. Its prediction is compared against the actual next frame, and it is trained to reduce the error between its prediction and reality. This alone is already a good strategy. However, we can do better. Anything that it gets wrong (its errors) are passed up to a second network. The second network is structured exactly like the first, when it gets an input (errors from the first network), it tries to predict what the input (again errors from the first network) for the next frame will be. This whole process is repeated many times, with higher layers all predicting the errors that the previous layers are going to make.
The first network by itself makes quite a good prediction. The first network combined with the second network (that attempts to correct for its errors) makes an even better prediction. Combining the first prediction with the second, and so on, all the way to the top, gets the full prediction of the entire network. There is no single place where the full prediction exists (just like there's no single place in the brain where our experience exists).
A noisy Signal
The following video shows a clip from the video that is used as input to the network. It's just some simple static images with some added noise.
The results are from training on just 500 input frames to the networks. On the left below is the result from training the network to output its inputs. There is no prediction going on here, just an attempt to copy the input to the output. On the right is the predicting version. It attempts to predict the next input. The only difference between the two networs is that one trains on its current inputs whilst the other is trained on its next input.
Learning to Copy the Input
Learning to Predict the Input
The reason for this result is fairly simple. Training based on copying the input learns to recreate the noise. Training to predict learns to output the difference between two different instances of noise, which on average will cancel out (when the noise has 0 mean). The really neat thing is that if a system is based on prediction, then you get this result for free. Noise is filtered out automatically.
This is a neat visual example of just one of the reasons why predicting inputs is a good strategy for decoding information about the world. Using a prediction strategy automatically filters out (unbiased) noise.
For completeness, here is the input and prediction side by side. There is no way to know when the background image will change so the network takes a frame to react.