Extreme Learning Machines

Recently I’m studying this idea called “Extreme Learning Machine”. I dated with this method around 2 years ago, and at that time I was using this method to help me quickly classifying some features. And I didn’t really dive into this.

Extreme Learning Machine (ELM) is a fairly simple method. It is a generalization of Single-hidden Layer Feedforward Neural Networks (SLFNs). You just have to project your data to hidden layer by random weights and then compute the target using least-square solutions. All of this suggest that it’s a simple regression model.

ELM is known for its simplicity, short running time and unusual performance. These three maybe the most problematic part in conventional models. MLP or Convolutional Neural Networks (CNNs) are very painful to train. RBM methods are even more painful without proper optimization. However before I write something about ELM, I need to tell my major concerns:

Are there any reasons that major Machine Learning community distanced itself with ELM? This probably a strong word. As I know of, NTU in Singapore invented this method, and they have even their own ELM conference in China. However, we couldn’t find many discussion over this method. And people are not taking so serious with it.
The big doubt of this method is that it’s heavily computing inverse of a square matrix. Well, it’s a product of your cost function’s solution, it’s normal in computing. However, we usually don’t want this in learning algorithm because most inverse computing kit are having limited precision so your performance is also limited. And that is pretty much why we like Gradient Descent.
ELM is designed for SLFNs, therefore, by default, it has only 1 hidden layer. We can use some other ways to make it deep, however, it’s not natural. You need to train the model like a Stacked Auto-encoders using Greedy Layer-wise Training to construct a deeper net.

While I’m writing this note, I received many comments from other researchers’ comments on ELM. I felt that I should list down all the perspectives so that you can have a complete view of this method.

There are thousands ways of criticizing ELM, and there are also another thousands ways of supporting the method. Bottom line, in current scale of data, it’s working. It may not work in large scale (Never saw it produced comparable results in large datasets such as ImageNet or MIT Places), but it may not be wasting time. The potential of random weights are now fully discovered yet and there is no way to tell current ANN models are plausible in biological system (definitely not, if our brain is firing like this, then we are screwed).

Extreme Learning Machine: basic story

You may find different terminology that is used by ELM papers in this section, but the idea is the same

ELM is a SLFN. Let \(W_{i}\) as input weights, \(X=\{x^{(1)}, x^{(1)}, \ldots, x^{(N)}\}\) as input samples, \(b\) as bias. Then the hidden activation \(H\) is computed by

\[H=f(X\cdot W_{i} + b)\]

where \(f(\cdot)\) is activation function.

Given output weight \(W_{o}\) and output target \(T\), ELM is to minimize following cost function:

\[J(H, T, W_{o})=C||H\cdot W_{o} - T||_{p}^{\sigma_{1}}+||W_{o}||_{q}^{\sigma_{2}}\]

where \(C\) is regularization term, \(\sigma_{1}>0\) and \(\sigma_{2}>0\), \(p\) and \(q\) is indicating the norm. The above function is clearly a Linear Regression formulation. Note that unlike conventional neural networks, there is no activation function (or say it’s a linear activation) to output layer. It’s not hard to say that we can use Gradient Descent to find the representation of \(W_{o}\). However, ELM offered an analytical solution when \(p=q=\sigma_{1}=\sigma_{2}=2\):

\[W_{o}=\left(\frac{I}{C}+H^{\top}H\right)^{-1}H^{\top}T\]

And this is the entire story of basic ELM.

ELM Auto-encoder

ELM is trying to learn a mapping between randomly projected weights and target. This makes itself as a decoder. Therefore, we can modify cost function a little, so that it can learn a mapping between signal and projected weights.

\[J(H, X, W_{o})=C||H\cdot W_{o} - X||_{p}^{\sigma_{1}}+||W_{o}||_{q}^{\sigma_{2}}\]

This makes ELM is acting like an auto-encoder. And transformation of output weight matrix \(W_{o}^{\top}\) is a encoder.

Stacked ELM Auto-encoders

Like Stacked Auto-encoders (SAEs), you can use the same principle as Greedy Layer-wise Training to train a multi layered ELM network. At the end you will have a unsupervised feature extractor (or a supervised network if you plug a normal ELM at the end of the network). And it doesn’t need any further tuning.

Local Receptive Field-ELM (LRF-ELM)

As we know, ELM is essentially a generalization of SLFNs. Therefore, all the variants of SLFNs can apply the same learning principle as ELM. And we can extend this theory to Convolutional Neural Networks (ConvNets). The basic understanding of ConvNet is that instead of learning complex function of entire receptive field, it learns representation from small region of the receptive fields. And in results, you need to have many feature maps in order to obtain better results.

Assume that you understand the idea of ConvNets, LRF-ELM firstly initialized \(K\) random filters (you can also orthogonalize these weights by using SVD), then it computes feature \(f^{(k)}\) by:

\[f^{(k)}=F\left(X* w^{(k)}\right)\]

where \(w^{(k)}\) is the \(k\)-th random filter and \(F\) is the activation function. After this, you can then apply a pooling operation to the feature maps.

The rest story is simple then, you can simply flatten the feature maps and learn target by previous mentioned equation.

LRF-ELM Auto-encoder

The solution to LRF-ELM Auto-encoder is not so obvious once you tried to figure it out. Because LRF-ELM is similar to ConvNet, therefore you couldn’t use the same solution as previous to derive the result. Here after some investigation, I figured a way of learning filters in unsupervised learning manner.

Consider we have \(K\) filters (size doesn’t matter), then the feature map \(f^{(k)}\) is calculated as

\(f^{(k)}=F\left(X* w^{(k)}\right)\) where \(w^{(k)}\) is the \(k\)-th random filter and \(F(\cdot)\) is the activation function.

In the decoding stage, the filters can be learned by using feature maps as filter. And learned filter \(\hat{w}^{(k)}\) is computed by

\[\hat{w}^{(k)}=X* f^{(k)}\cdot \left(\frac{1}{\frac{I}{C}+\sum_{i,j}\left(f_{i,j}^{(k)}\right)^{2}}\right)\]

Incremental Learning of ELM

One of the problem of original ELM is that it doesn’t mention anything about dimension of the data. And in most of recent cases, it’s almost impossible to compute with entire data, therefore, we must find a way of learning output weights incrementally.

In previous paper A Fast and Accurate Online Sequential Learning Algorithm for Feedforward Networks, it offered a nice solution to this problem, you can update the output weight by:

\[P_{k+1}=P_{k}-P_{k}H_{k+1}^{\top}\left(I+H_{k+1}P_{k}H_{k+1}^{\top}\right)^{-1}H_{k+1}P_{k}\] \[W_{o}^{k+1}=W_{o}^{k}+P_{k+1}H_{k+1}^{\top}(T_{k+1}-H_{k+1}W_{o}^{k})\]

where \(P_{k}=K_{k}^{-1}\). \(K_{0}\) is \(H_{0}^{\top}H_{0}\), \(W_{o}^{0}\) is \(K_{0}^{-1}H_{0}^{\top}T_{0}\) and

\[K_{k+1}=K_{k}+H_{k+1}^{\top}H_{k}\]

You might think that if this is the way of computing output weights, then the LRF-ELM is a real mess. Turns out, it’s even simpler since \(K\) in LRF-ELM is a scalar. And it’s inverse is \(\frac{1}{K}\).

Another way of updating output weights is suggested by another paper: Fast, simple and accurate handwritten digit classification using extreme learning machines with shaped input-weights. Instead of updating output weights, it simply updating to component of learning function: \(H^{\top}H\) and \(H^{\top}T\). In this way, you will only maintain this two fixed size matrix, and the output weights can be computed anytime from this two components. The updating rule is easy to derive:

\[K_{k+1}=K_{k}+H_{k+1}^{\top}H_{k}\] \[A_{k+1}=A_{k}+H_{k+1}^{\top}T_{k+1}\]

where \(K_{0}=H_{0}^{\top}H_{0}\) and \(A_{0}=H_{0}^{\top}T_{0}\)

Maybe another way of training ELM

So far we reviewed and explored variants of ELM. However, it doesn’t really fit in conventional Machine Learning where we use back-propagation and everything. The last section of incremental learning is somewhat “Machine Learning”-ish. I still felt it’s not natural enough.

This brought me to a recent proposed method, it is described in Random feedback weights support learning in deep neural networks. I admired the approach that it is described. Without computing the gradient of weights and bias, one can still use back-propagation algorithm to train a neural network. With this approach, one can train MLP, ConvNets and other Feedforward Network variants without any trouble.

Generalizing ELM

I should say this again, the original proposal of ELM is a method that tries to generalize SLFNs. And it clearly tries to characterize either the output target or the signal itself. But it is not cleaning the data. In fact, most neural networks in the market does not try to clean the signal before they try to characterize it. The hope of cleaning data is on hierarchy structure and the abstract features is hoped to be clean. Have an ELM ever tried to characterize hidden activation itself? And instead of a hidden feedforward layer, can we replace it as a recurrent hidden layer? And if in this assumption, does ELM have to be a single-layered structure?

Okay, so far, like other neural networks, ELM tries to model either a target (classification labels or a regression target) or the input itself. This is conventional. However, what if we replace the target as hidden activation itself? So the cost function looks like this:

\[J(H, W_{o})=C||H\cdot W_{o} - H||_{p}^{\sigma_{1}}+||W_{o}||_{q}^{\sigma_{2}}\]

Is this even making any sense? Every one will say that in this case \(W_{o}\) is basically a regularized identity map. It’s very clear that the cost will decrease to 0 if \(W_{0}\) is an identity matrix. So, if it’s an identity matrix, why would I even try to learn it?

Well, wrong. There are plenty things you can discover. If you replace \(T\) as \(H\) in your ELM’s solution, it’s not hard to discover that actually this \(W_{o}\) is closely related to hidden activation’s correlation matrix. And, why does ELM’s hidden layer have to be a feedforward hidden layer? Can it be a recurrent layer? The answer is YES. And by the way, this whole thing is called conceptor network. It is developed by Herbert Jaeger from Jacobs University Bremen, you can find the complete technical paper from here. I think Professor Jaeger himself didn’t realized that there is a close relative of this idea that has been invented for almost 10 years.

Alright, so you can extend ELM to conceptors, what’s the big deal? The BIG DEAL is that ELM and it’s theory doesn’t have to be only a generalization of SLFNs, it can be a multi-layer hierarchy that can automatically clean and learn pattern of the data, offers a way of control neural dynamics naturally and biologically plausible. Well, I’m not going to review the full context of the naive conceptor formulation, but directly pay attention on its final offer of the entire theory: Random Feature Conceptor. (The following context needs understanding of naive conceptor and auto-conceptor). In following context, I’m using conceptor’s terminology.

The basic idea is to project reservoir state to a large feature state, so the formulation of the network is:

\[r(n+1)=\tanh(Gz(n)+W^{\text{in}}p(n)+b)\] \[z(n+1)=\text{diag}(c(n))F'r(n)\]

where \(r(n)\in\mathbb{R}^{N\times 1}\), \(G\in\mathbb{R}^{N\times M}\), \(z(n)\in\mathbb{R}^{M\times 1}\), \(p(n)\in\mathbb{R}^{L\times 1}\), \(c(n)\in\mathbb{R}^{M\times 1}\), \(F'\in\mathbb{R}^{M\times N}\).

Now it’s easy to formulate the cost function as previous mentioned:

\[J(z, c)=\frac{1}{M}\sum_{i}||z_{i}-c_{i}z_{i}||^{2}+\frac{\alpha^{-2}}{M}\sum_{i}c_{i}^{2}\]

You can also calculate the fixed-point solution in this way:

\[c_{i}=E\left[z_{i}^{2}\right] (E[z_{i}]+\alpha^{-2})^{-1}\]

And of course you can adapt the entire \(c\) using stochastic gradient descent.

Well, here is a drawback, and it’s a serious one: it’s very hard to scale up, especially for high-dimensional data like images. Withe millions of images, the system is going to compute between very large matrices.

At this point, we don’t really understand the full power of conceptor network and ELM yet. But one thing is for sure, there is no evidence where ELM or conceptor network is succeeded in large datasets, and this is also what we mostly worried.