Autoencoder

Schematic structure of an autoencoder with 3 fully connected hidden layers.

An autoencoder is a type of artificial neural network used to learn efficient data codings in an unsupervised manner.^[1]^[2] The aim of an autoencoder is to learn a representation (encoding) for a set of data, typically for dimensionality reduction, by training the network to ignore signal “noise”. Along with the reduction side, a reconstructing side is learnt, where the autoencoder tries to generate from the reduced encoding a representation as close as possible to its original input, hence its name. Recently, the autoencoder concept has become more widely used for learning generative models of data.^[3]^[4] Some of the most powerful AIs in the 2010s involved sparse autoencoders stacked inside of deep neural networks.^[5]

Purpose

An autoencoder learns to compress data from the input layer into a short code, and then uncompress that code into something that closely matches the original data. This forces the autoencoder to engage in dimensionality reduction, for example by learning how to ignore noise. Some architectures use stacked sparse autoencoder layers for image recognition.^{[citation needed]} The first encoding layer might learn to encode basic features such as corners, the second to analyze the first layer's output and then encode less local features like the tip of a nose, the third might encode a whole nose, etc., until the final encoding layer encodes the whole image into a code that matches (for example) the concept of "cat".^[5] The decoding layers learn to decode the representation back into its original form as closely as possible. An alternative use is as a generative model: for example, if a system is manually fed the codes it has learned for "cat" and "flying", it may attempt to generate an image of a flying cat, even if it has never seen a flying cat before.^[3]^[6]

Structure

The simplest form of an autoencoder is a feedforward, non-recurrent neural network similar to single layer perceptrons that participate in multilayer perceptrons (MLP) – having an input layer, an output layer and one or more hidden layers connecting them – where the output layer has the same number of nodes (neurons) as the input layer, and with the purpose of reconstructing its inputs (minimizing the difference between the input and the output) instead of predicting the target value $Y$ given inputs $X$ . Therefore, autoencoders are unsupervised learning models (do not require labeled inputs to enable learning).

An autoencoder consists of two parts, the encoder and the decoder, which can be defined as transitions $\phi$ and $\psi ,$ such that:

\phi :{\mathcal {X}}\rightarrow {\mathcal {F}}

\psi :{\mathcal {F}}\rightarrow {\mathcal {X}}

\phi ,\psi ={\underset {\phi ,\psi }{\operatorname {arg\,min} }}\,\|X-(\psi \circ \phi )X\|^{2}

In the simplest case, given one hidden layer, the encoder stage of an autoencoder takes the input $\mathbf {x} \in \mathbb {R} ^{d}={\mathcal {X}}$ and maps it to $\mathbf {z} \in \mathbb {R} ^{p}={\mathcal {F}}$ :

\mathbf {z} =\sigma (\mathbf {Wx} +\mathbf {b} )

This image $\mathbf {z}$ is usually referred to as code, latent variables, or latent representation. Here, $\sigma$ is an element-wise activation function such as a sigmoid function or a rectified linear unit. $\mathbf {W}$ is a weight matrix and $\mathbf {b}$ is a bias vector. After that, the decoder stage of the autoencoder maps $\mathbf {z}$ to the reconstruction $\mathbf {x'}$ of the same shape as $\mathbf {x}$ :

\mathbf {x'} =\sigma '(\mathbf {W'z} +\mathbf {b'} )

where $\mathbf {\sigma '} ,\mathbf {W'} ,{\text{ and }}\mathbf {b'}$ for the decoder may be unrelated to the corresponding $\mathbf {\sigma } ,\mathbf {W} ,{\text{ and }}\mathbf {b}$ for the encoder.

Autoencoders are trained to minimise reconstruction errors (such as squared errors), often referred to as the "loss":

{\mathcal {L}}(\mathbf {x} ,\mathbf {x'} )=\|\mathbf {x} -\mathbf {x'} \|^{2}=\|\mathbf {x} -\sigma '(\mathbf {W'} (\sigma (\mathbf {Wx} +\mathbf {b} ))+\mathbf {b'} )\|^{2}

where $\mathbf {x}$ is usually averaged over some input training set.

Should the feature space ${\mathcal {F}}$ have lower dimensionality than the input space ${\mathcal {X}}$ , the feature vector $\phi (x)$ can be regarded as a compressed representation of the input $x$ . If the hidden layers are larger than the input layer, an autoencoder can potentially learn the identity function and become useless. However, experimental results have shown that autoencoders might still learn useful features in these cases.^[7]^:19

Variations

Various techniques exist to prevent autoencoders from learning the identity function and to improve their ability to capture important information and learn richer representations:

Denoising autoencoder

Denoising autoencoders take a partially corrupted input whilst training to recover the original undistorted input. This technique was introduced to achieve good representation.^[8] A good representation is one that can be obtained robustly from a corrupted input and that will be useful for recovering the corresponding clean input. This definition contains the following assumptions:

Higher level representations are relatively stable and robust to the corruption of the input;
It is necessary to extract features that are useful for representation of the input distribution.

To train an autoencoder to denoise data, it is necessary to perform preliminary stochastic mapping $\mathbf {x} \rightarrow \mathbf {\tilde {x}}$ in order to corrupt the data and use $\mathbf {\tilde {x}}$ as input, with only the exception that the loss is computed for the initial input ${\mathcal {L}}(\mathbf {x} ,\mathbf {{\tilde {x}}'} )$ instead of ${\mathcal {L}}(\mathbf {\tilde {x}} ,\mathbf {{\tilde {x}}'} )$ .

Sparse autoencoder

Autoencoders were invented in the 1980s; however, the initial versions were difficult to train, as the encodings had to compete to set the same small set of bits. This was solved by "sparse autoencoding". Sparse autoencoder include more (rather than fewer) hidden units than inputs, but only a small number of the hidden units are allowed to be active at once.^[5]

Sparsity may be achieved by additional terms in the loss function during training (by comparing the probability distribution of the hidden unit activations with some low desired value),^[9] or by manually zeroing all but the strongest hidden unit activations (k-sparse autoencoder).^[10]

Variational autoencoder (VAE)

Variational autoencoder models make strong assumptions concerning the distribution of latent variables. They use a variational approach for latent representation learning, which results in an additional loss component and a specific estimator for the training algorithm called the Stochastic Gradient Variational Bayes (SGVB) estimator.^[3] It assumes that the data is generated by a directed graphical model $p(\mathbf {x} |\mathbf {z} )$ and that the encoder is learning an approximation $q_{\phi }(\mathbf {z} |\mathbf {x} )$ to the posterior distribution $p_{\theta }(\mathbf {x} |\mathbf {z} )$ where ${\mathbf {\phi }}$ and $\mathbf {\theta }$ denote the parameters of the encoder (recognition model) and decoder (generative model) respectively. The probability distribution of the latent vector of a VAE typically matches that of the training data much closer than a standard autoencoder. The objective of VAE has the following form:

{\mathcal {L}}(\mathbf {\phi } ,\mathbf {\theta } ,\mathbf {x} )=D_{\mathrm {KL} }(q_{\phi }(\mathbf {z} |\mathbf {x} )\Vert p_{\theta }(\mathbf {z} ))-\mathbb {E} _{q_{\phi }(\mathbf {z} |\mathbf {x} )}{\big (}\log p_{\theta }(\mathbf {x} |\mathbf {z} ){\big )}

Here, $D_{{{\mathrm {KL}}}}$ stands for the Kullback–Leibler divergence. The prior over the latent variables is usually set to be the centred isotropic multivariate Gaussian $p_{\theta }(\mathbf {z} )={\mathcal {N}}(\mathbf {0,I} )$ ; however, alternative configurations have been considered.^[11]

Commonly, the shape of the variational and the likelihood distributions are chosen such that they are factorized Gaussians:

{\begin{aligned}q_{\phi }(\mathbf {z} |\mathbf {x} )&={\mathcal {N}}({\boldsymbol {\rho }}(\mathbf {x} ),{\boldsymbol {\omega }}^{2}(\mathbf {x} )\mathbf {I} ),\\p_{\theta }(\mathbf {x} |\mathbf {z} )&={\mathcal {N}}({\boldsymbol {\mu }}(\mathbf {z} ),{\boldsymbol {\sigma }}^{2}(\mathbf {z} )\mathbf {I} )),\end{aligned}}

where ${\boldsymbol {\rho }}(\mathbf {x} )$ and ${\boldsymbol {\omega }}^{2}(\mathbf {x} )$ are the encoder outputs, while ${\boldsymbol {\mu }}(\mathbf {z} )$ and ${\boldsymbol {\sigma }}^{2}(\mathbf {z} )$ are the decoder outputs. This choice is justified by the simplifications^[3] that it produces when evaluating both the KL divergence and the likelihood term in variational objective defined above.

VAE have been criticized because they generate blurry images.^[12] However, researchers employing this model were showing only the mean of the distributions, ${\boldsymbol {\mu }}(\mathbf {z} )$ , rather than a sample of the learned Gaussian distribution

\mathbf {x} \sim {\mathcal {N}}({\boldsymbol {\mu }}(\mathbf {z} ),{\boldsymbol {\sigma }}^{2}(\mathbf {z} )\mathbf {I} ))

.

These samples were shown to be overly noisy due to the choice of a factorized Gaussian distribution.^[13]^[14] Employing a Gaussian distribution with a full covariance matrix,

p_{\theta }(\mathbf {x} |\mathbf {z} )={\mathcal {N}}({\boldsymbol {\mu }}(\mathbf {z} ),{\boldsymbol {\Sigma }}(\mathbf {z} ))),

could solve this issue, but is computationally intractable and numerically unstable, as it requires estimating a covariance matrix from a single data sample. However, later research^[13]^[14] showed that a restricted approach where the inverse matrix ${\boldsymbol {\Sigma }}^{-1}(\mathbf {z} )$ is sparse, could be tractably employed to generate images with high-frequency details.

Contractive autoencoder (CAE)

Contractive autoencoder adds an explicit regularizer in their objective function that forces the model to learn a function that is robust to slight variations of input values. This regularizer corresponds to the Frobenius norm of the Jacobian matrix of the encoder activations with respect to the input. The final objective function has the following form:

{\mathcal {L}}(\mathbf {x} ,\mathbf {x'} )+\lambda \sum _{i}||\nabla _{x}h_{i}||^{2}

Relationship with principal component analysis (PCA)

If linear activations are used, or only a single sigmoid hidden layer, then the optimal solution to an autoencoder is strongly related to principal component analysis (PCA).^[15]^[16] The weights of an autoencoder with a single hidden layer of size $p$ (where $p$ is less than the size of the input) span the same vector subspace as the one spanned by the first $p$ principal components, and the output of the autoencoder is an orthogonal projection onto this subspace. The autoencoder weights are not equal to the principal components, and are generally not orthogonal, yet the principal components may be recovered from them using the singular value decomposition.^[17]

Training

The training algorithm for an autoencoder can be summarized as

For each input

x

,

Do a feed-forward pass to compute activations at all hidden layers, then at the output layer to obtain an output

\mathbf {x'}

Measure the deviation of

\mathbf {x'}

from the input

\mathbf {x}

(typically using squared error),

Backpropagate the error through the net and perform weight updates.

An autoencoder is often trained using one of the many variants of backpropagation (such as conjugate gradient method, steepest descent, etc.). Though these are often reasonably effective, there are fundamental problems with the use of backpropagation to train networks with many hidden layers. Once errors are backpropagated to the first few layers, they become minuscule and insignificant. This means that the network will almost always learn to reconstruct the average of all the training data.^{[citation needed]} Though more advanced backpropagation methods (such as the conjugate gradient method) can solve this problem to a certain extent, they still result in a very slow learning process and poor solutions. This problem can be remedied by using initial weights that approximate the final solution. The process of finding these initial weights is often referred to as pretraining.

Stephen Luttrell, while based at RSRE, developed a technique for unsupervised training of hierarchical self-organizing neural nets with "many hidden layers",^[18] which are equivalent to deep autoencoders. Geoffrey Hinton developed an alternative pretraining technique for training many-layered deep autoencoders. This method involves treating each neighbouring set of two layers as a restricted Boltzmann machine so that the pretraining approximates a good solution, then using a backpropagation technique to fine-tune the results.^[19] This model takes the name of deep belief network.

References

^ Liou, Cheng-Yuan; Huang, Jau-Chi; Yang, Wen-Chie (2008). "Modeling word perception using the Elman network". Neurocomputing. 71 (16–18): 3150. doi:10.1016/j.neucom.2008.04.030.
^ Liou, Cheng-Yuan; Cheng, Wei-Chen; Liou, Jiun-Wei; Liou, Daw-Ran (2014). "Autoencoder for words". Neurocomputing. 139: 84–96. doi:10.1016/j.neucom.2013.09.055.
^ ^a ^b ^c ^d Diederik P Kingma; Welling, Max (2013). "Auto-Encoding Variational Bayes". arXiv:1312.6114 [stat.ML].
^ Generating Faces with Torch, Boesen A., Larsen L. and Sonderby S.K., 2015 torch.ch/blog/2015/11/13/gan.html
^ ^a ^b ^c Domingos, Pedro (2015). "4". The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World. Basic Books. "Deeper into the Brain" subsection. ISBN 978-046506192-1.
^ Hutson, Matthew (23 February 2018). "New algorithm can create movies from just a few snippets of text". Science. doi:10.1126/science.aat4126.
^ Bengio, Y. (2009). "Learning Deep Architectures for AI" (PDF). Foundations and Trends in Machine Learning. 2: 1–127. CiteSeerX 10.1.1.701.9550. doi:10.1561/2200000006.
^ Vincent, Pascal; Larochelle, Hugo; Lajoie, Isabelle; Bengio, Yoshua; Manzagol, Pierre-Antoine (2010). "Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion". The Journal of Machine Learning Research. 11: 3371–3408.
^ sparse autoencoders (PDF)
^ Makhzani, Alireza; Frey, Brendan (2013). "k-sparse autoencoder". arXiv:1312.5663 [cs.LG].
^ Partaourides, Harris; Chatzis, Sotirios P (2017). "Asymmetric deep generative models". Neurocomputing. 241: 90–96. doi:10.1016/j.neucom.2017.02.028.
^ Larsen, A. B. L.; Sønderby, S. K.; Larochelle, H.; Winther, O. (2016). "Autoencoding beyond pixels using a learned similarity metric". Proceedings of the 33rd International Conference on Machine Learning. 28: 1558–1566. arXiv:1512.09300. Bibcode:2015arXiv151209300B.
^ ^a ^b Dorta, G.; Vicente, S.; Agapito, L.; Campbell, N. D. F.; Simpson, I. (2018). "Structured Uncertainty Prediction Networks". The IEEE Conference on Computer Vision and Pattern Recognition (CVPR): 5477–5485. arXiv:1802.07079. Bibcode:2018arXiv180207079D.
^ ^a ^b Dorta, G.; Vicente, S.; Agapito, L.; Campbell, N. D. F.; Simpson, I. (2018). "Training VAEs Under Structured Residuals". arXiv:1804.01050 [stat.ML].
^ Bourlard, H.; Kamp, Y. (1988). "Auto-association by multilayer perceptrons and singular value decomposition". Biological Cybernetics. 59 (4–5): 291–294. doi:10.1007/BF00332918. PMID 3196773.
^ Chicco, Davide; Sadowski, Peter; Baldi, Pierre (2014). "Deep autoencoder neural networks for gene ontology annotation predictions". Proceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics - BCB '14. p. 533. doi:10.1145/2649387.2649442. ISBN 9781450328944.
^ Plaut, E (2018). "From Principal Subspaces to Principal Components with Linear Autoencoders". arXiv:1804.10253 [stat.ML].
^ Luttrell, S. P. (October 1989). "Hierarchical self-organising networks". 1989 First IEE International Conference on Artificial Neural Networks. London: Institution of Engineering and Technology. pp. 2–6.
^ Hinton, G. E.; Salakhutdinov, R.R. (28 July 2006). "Reducing the Dimensionality of Data with Neural Networks". Science. 313 (5786): 504–507. Bibcode:2006Sci...313..504H. doi:10.1126/science.1127647. PMID 16873662.

[1] Liou, Cheng-Yuan; Huang, Jau-Chi; Yang, Wen-Chie (2008). "Modeling word perception using the Elman network". Neurocomputing. 71 (16–18): 3150. doi:10.1016/j.neucom.2008.04.030.

[2] Liou, Cheng-Yuan; Cheng, Wei-Chen; Liou, Jiun-Wei; Liou, Daw-Ran (2014). "Autoencoder for words". Neurocomputing. 139: 84–96. doi:10.1016/j.neucom.2013.09.055.

[VAE-3] Diederik P Kingma; Welling, Max (2013). "Auto-Encoding Variational Bayes". arXiv:1312.6114 [stat.ML].

[gan_faces-4] Generating Faces with Torch, Boesen A., Larsen L. and Sonderby S.K., 2015 torch.ch/blog/2015/11/13/gan.html

[domingos-5] Domingos, Pedro (2015). "4". The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World. Basic Books. "Deeper into the Brain" subsection. ISBN 978-046506192-1.

[6] Hutson, Matthew (23 February 2018). "New algorithm can create movies from just a few snippets of text". Science. doi:10.1126/science.aat4126.

[bengio-7] Bengio, Y. (2009). "Learning Deep Architectures for AI" (PDF). Foundations and Trends in Machine Learning. 2: 1–127. CiteSeerX 10.1.1.701.9550. doi:10.1561/2200000006.

[ref9-8] Vincent, Pascal; Larochelle, Hugo; Lajoie, Isabelle; Bengio, Yoshua; Manzagol, Pierre-Antoine (2010). "Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion". The Journal of Machine Learning Research. 11: 3371–3408.

[9] sparse autoencoders (PDF)

[10] Makhzani, Alireza; Frey, Brendan (2013). "k-sparse autoencoder". arXiv:1312.5663 [cs.LG].

[11] Partaourides, Harris; Chatzis, Sotirios P (2017). "Asymmetric deep generative models". Neurocomputing. 241: 90–96. doi:10.1016/j.neucom.2017.02.028.

[12] Larsen, A. B. L.; Sønderby, S. K.; Larochelle, H.; Winther, O. (2016). "Autoencoding beyond pixels using a learned similarity metric". Proceedings of the 33rd International Conference on Machine Learning. 28: 1558–1566. arXiv:1512.09300. Bibcode:2015arXiv151209300B.

[SigmaVAE1-13] Dorta, G.; Vicente, S.; Agapito, L.; Campbell, N. D. F.; Simpson, I. (2018). "Structured Uncertainty Prediction Networks". The IEEE Conference on Computer Vision and Pattern Recognition (CVPR): 5477–5485. arXiv:1802.07079. Bibcode:2018arXiv180207079D.

[SigmaVAE2-14] Dorta, G.; Vicente, S.; Agapito, L.; Campbell, N. D. F.; Simpson, I. (2018). "Training VAEs Under Structured Residuals". arXiv:1804.01050 [stat.ML].

[15] Bourlard, H.; Kamp, Y. (1988). "Auto-association by multilayer perceptrons and singular value decomposition". Biological Cybernetics. 59 (4–5): 291–294. doi:10.1007/BF00332918. PMID 3196773.

[16] Chicco, Davide; Sadowski, Peter; Baldi, Pierre (2014). "Deep autoencoder neural networks for gene ontology annotation predictions". Proceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics - BCB '14. p. 533. doi:10.1145/2649387.2649442. ISBN 9781450328944.

[17] Plaut, E (2018). "From Principal Subspaces to Principal Components with Linear Autoencoders". arXiv:1804.10253 [stat.ML].

[18] Luttrell, S. P. (October 1989). "Hierarchical self-organising networks". 1989 First IEE International Conference on Artificial Neural Networks. London: Institution of Engineering and Technology. pp. 2–6.

[19] Hinton, G. E.; Salakhutdinov, R.R. (28 July 2006). "Reducing the Dimensionality of Data with Neural Networks". Science. 313 (5786): 504–507. Bibcode:2006Sci...313..504H. doi:10.1126/science.1127647. PMID 16873662.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]