Neural Networks and Reservoir Computing

neuron
Illustration of a neuron

While we already have a page explaining how the reservoir computing paradigm works, it is worth discussing the relationship between this paradigm and the machine learning paradigm of neural networks. Most implementations of reservoir computers, including ours, can be viewed as types of artificial neural networks. Artificial neural networks are an artificial intelligence paradigm that is loosely inspired by how biological neural networks work naturally. The key idea is a network of nodes that somehow process and share information and have some trainable parameters. In other words, it can be changed to make the network better at performing a desired task. Conceptually this is similar to how a brain works. There are cells that exchange information between them and these cells change their connections with time to learn how to perform tasks. This is, of course, a vast oversimplification of how brains and learning work but conceptually accurate at a very abstract level.

network comparison
Left: an illustration of how a traditional neural network works. Right: an illustration of how a reservoir neural network works.

Linear operations and tensors

In artificial neural networks, the information exchanged is usually numbers or sets of numbers (i.e. vectors). The basic operations to move this information around are known as linear operations, adding and subtracting numbers, possibly in a weighted way, i.e. multiplying them by something and then summing. For example, for variables aa and bb the equation f=3a5bf=3a-5b would be linear but f=abf=ab or f=a2f=a^2 would not be linear. Since neural networks are often very large, it would be cumbersome to deal with them in the way that introductory algebra is usually done: by giving each variable a name and writing a set of equations. This works well for a handful of equations, say two or three, but representing millions of equations and variables in such a way would be unworkable.

Fortunately, there is a more systematic way to handle a large number of linear equations. Any set of linear equations can be represented as a matrix, a two-dimensional array of numbers where one side has a size equal to the number of variables and the other has a size equal to the number of equations. For neural networks a slight generalization is useful. Sometimes to keep track of where data came from, arrays in more than two dimensions are used. For example in image processing, it is convenient to distinguish color channels (i.e. the amount of red, green, or blue) and position in two dimensions, this means that the natural object here is a three-dimensional array per pixel, leading to an overall five-dimensional array, (vertical and horizontal coordinate and each of three colors). These more general arrays are called “tensors” by machine learning practitioners. An important warning here is that the tensors used in neural networks are a specific special case of the tensors used in mathematics or physics. In mathematics and physics a tensor with dimension three or higher can be used to represent non-linear interactions (eg. multilinear maps). An example can be seen below with a three-index tensor representing non-linear interactions between elements of b (with the left-hand side using implied summation and the right-hand side showing it explicitly).

Akijbibji,jAkijbibjA_k^{ij}b_ib_j\rightarrow \sum_{i,j}A_k^{ij}b_ib_j

 In fact, the nonlinear equations that describe gravity are usually represented using tensors. In contrast, machine learning tensors are more like matrices or vectors which have been given more dimensions for bookkeeping purposes, but their mathematics is linear. The tensor mathematics used in machine learning can be considered to be a special case of math/physics tensors where nonlinearity is avoided. For example, a linear (in bb which is a two-index tensor) three-index tensor operation can be written as 

Akijbiji,jAkijbijA_k^{ij}b_{ij}\rightarrow \sum_{i,j}A_k^{ij}b_{ij}

which is very useful for bookkeeping. For example, if bijb_{ij} represents pixels of a two-dimensional image, where only a single index would be unnatural and cumbersome. These tensors (a linear special case of the more general math/physics concept) are a key structure in neural networks, so much so that one of the leading software platforms for implementing neural networks is known as TensorFlow.

Nonlinear nodes

While linear operations are useful for manipulating and controlling the flow of data in a neural network, they cannot be used to build powerful networks by themselves. They just do not have the mathematical capability to perform sophisticated learning. There are numerous ways to see this, one is to realize that if we build a purely linear neural network, no matter how big it is, it can always just be collapsed into a set of linear equations describing the outputs in terms of the inputs. It should be fairly clear that, for example, there is no way to sum the pixel brightnesses in an image such that this value reliably tells you whether the image is a cat or something else. This is all a purely linear network could do though.

For an artificial neural network to be able to reliably perform sophisticated tasks like recognizing cats, it needs some nonlinearity. Once nonlinearity is added, the expression for the outputs cannot be simply expressed as a weighted sum of the inputs. It introduces a level of complexity to the transformation. In principle, it seems that almost any form of nonlinearity can introduce the level of complexity we need, so these are often relatively simple. For example, a rectified linear unit (ReLU) simply sets its input to zero if it is negative and leaves it as it is if it is positive. This generality in nonlinearity has been made rigorous; a shallow neural network with any of a very general class of nonlinear functions is capable of approximating arbitrary mathematical functions. Biological neural networks also contain strong degrees of nonlinearity, which comes through the action potential, which leads to a strong signal if a threshold is reached, but sends essentially no signal otherwise.

Training

From the description so far it sounds like a neural network could be made by haphazardly throwing together some linear operations, sprinkling a few nonlinear operations (of any type), and then we suddenly have a system that can recognize cats, drive cars, and do all of the other things neural networks might be used for. This isn’t reality, and one of the major reasons is that an artificial neural network needs to be trained. There are a variety of ways to train a neural network, and we won’t discuss them in great detail, but the point is that somehow the network needs to be modified to perform whatever tasks you need it to do. This training has to be done in some kind of automatic way, there is no way to just mathematically derive what the values of the different terms (known as weights) should be. In general, training is usually done by developing a mathematical function to determine how well the network performs (a loss function), and then determining how to change the weights to reduce the value of the loss function. The learning process is usually computationally expensive and often difficult to implement. 

One observation that neural network researchers have made is that deeper networks, with more layers of nodes between the input (what is being learned) and the outputs (the conclusions the network returns to the users), tend to be able to do better on more complex tasks. However, as a network gets deeper it tends to also become harder to train. This is somewhat intuitive. In deeper networks, the further information about how the loss function changes has to travel through, the more chances there are for things to go wrong. Larger training sets are also needed for deep networks, otherwise they can overfit, effectively memorizing random statistical fluctuations in the data rather than learning trends which will generalize.  Thus to make use of deeper neural networks, tricks need to be used to make sure training can still be effective. For a deep network, all kinds of factors, such as how the nodes of the network are connected, and the exact form of the nonlinear function, suddenly matter. While in principle almost any type of nonlinearity is very powerful, the network needs to actually be trained to be useful, and training will depend on how the nonlinear nodes behave. For example, the very simplified ReLU node we mentioned before is no longer useful if it is strongly biased to negative values in a way that it always returns a zero value. In this way, it is effectively removed from the network. Many other things can go wrong in training as well. While in principle the rules of calculus give a simple strategy (backpropagation) to update the weights of the network, these updates or “gradients” can either vanish or explode to impractically large values if not treated carefully. An important point for later discussion is that, as the name implies, backpropagation sends updates from the output toward the input. A history of the development of convolutional neural networks (CNNs), shows the numerous innovations that were needed to expand these networks and therefore increase their power in a way in which they are still trainable. Recurrent neural networks (RNNs), the class of neural networks that many reservoir systems formally belong to, have a similar history

Designing and training traditional networks requires substantial expertise, and even for experts it requires a significant amount of experimentation. The process of training neural networks is also computationally expensive and uses a substantial amount of energy. Training a single model can release the same amount of carbon as five cars during their lifetime. One subtle factor that increases both the financial cost and cost to the environment, is that training of complicated models requires experimentation to determine the best hyperparameters (settings that control how the network operates and is trained). So the whole cost to train the network isn’t just the cost to run training once, but to run it many times to set these hyperparameters. A study in networks commonly used in natural language processing found that this can substantially increase the cost and energy use in practice, effectively multiplying the amount of effort by the number of experimental tests. A case study example showed that when developing a network, choosing these parameters took almost 5,000 training cycles of the network and multiplied the electricity consumption by a factor of nearly 2,000 compared to a single training cycle (this number is less because some efficiency can be gained through parallelism).

Minimizing training through the reservoir approach 

We now get to the key strength of reservoir computing. A full discussion of our approach can be found here, but for the purpose of this discussion, we only need to understand the approach conceptually. In the reservoir approach, most of the network is initialized randomly and never trained. Only a small, relatively simple, fragment of the network right near the output is ever trained. This approach has several advantages. The part that is not trained does not need to be tunable, which greatly simplifies implementation on physical hardware. Furthermore, since the part that is trained is right by the output, there is no need to backpropagate any information through most of the network, so no concerns about the structure of the nonlinearity interfering with the training process. The final trained part of the network can be shallow, so no need to worry about exploding or vanishing gradients. In fact, since the untrained part of the network expands the space and performs nonlinear transformations, the part that is trained can even be linear. The nonlinear interactions that occur earlier in a reservoir machine mean that it cannot be described as a simple matrix operation.

Effectively reservoir computers are just much easier to train than other architectures and therefore avoid a key bottleneck, this means fewer parameters, so less work to train the network. It also means that much less expertise is needed from the user. Furthermore given the simplicity of the part being trained, little or no experimentation is needed to train such a system, thus the multiplier on computational resources coming from experimentation is greatly reduced or even eliminated.