ABSTRACTION

The pattern recognition system of the human mind basically operates by the process of abstraction. In deep learning, we would call this feature extraction or generalization.

Abstraction is of two types basically, I do not speak exhaustively because this is what simply comes to mind. The two types of abstraction are: 1. lossy abstraction 2. lossless abstraction. This is similar to the process of compression as we can have lossy and lossless compression.

Deep learning is actually a kind of lossy compression system. When we say we are training a neural network to do a simple task like recognizing digit characters, we are taking the raw representation of images as a list of numbers (vectors) and we are abstracting away at the unnecessary details using feature extraction to obtain the most relevant pieces of information that best describes images of a particular class.

So we throw away some of the information that the system deemed as unnecessary to get at that which is relevant to our task of image identification. In essence, we have compressed the original data in a lossy fashion in order to represent it in a form that best generalizes it.

Even in the scenario where are dealing with full colour images of cats, dogs or whatever, and we have input that comes in 3 channels (RGB) as a tensor, after passing the images through layers of convolution and subsampling we actually abstract away most of the details of the image to get at the best set of information that describes the class. This information usually gets packaged in a vector of lesser dimensionality than we started with.

There are a group of networks called autoencoders whose goal is to reduce the dimensionality of an input vector. This is like taking input data like the linearized vector of an MNIST digit and passing it through several layers of the network to obtain some low dimensional compressed representation of the original input as a code. There could be some decoder phase that tries to expand the code to an output stage with the hope that the output should resemble the input. The most important thing for us to take from this is that the code is an abstract representation of the raw concrete input data.

By Chervinskii - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=45555552

This code that is generated from the input data is usually called a low dimensional embedding of the input data. For example, using the Tensorflow embedding projector we can visualize each of the 784 dimensional vectored input data of MNIST digits in 3-dimensional space after the original data has had its dimensionality reduced using something like PCA (Principal Component Analysis).

Source: projector.tensorflow.org

From a very high perspective, this is what we are doing in our deep learning systems, even when we try to recognize real-world images which are in the order of millions of dimensions, we usually pass this large vector into a network and eventually at the output end we are dealing with a vector of 1000 probabilities.

In an actual example, we spoke of MNIST digit recognition task which is the hello world program of deep learning. The process involves designing a neural network with several layers. What these layers do is not really important for this discussion, because there are a lot of complicated activities going on in every neural network, but what we need to know is that the layers squish information! Squishing information can be seen as compressing information. The image is usually input as a list of 784-pixel elements, each image contains 28 x 28 pixels so that if it brought down from two dimensions that are having 28 pixels per row and 28 pixels per column, to having 784 pixels in a single row through linearization. This length 784 list will now be fed into the network and at the other side of the network. At the output side, we get another list of 10 items where the position of the item with the highest probability values indicates the digit that the input is most likely to be.

Source: http://mxnet.incubator.apache.org/tutorials/python/mnist.html

Looking at all this from a very abstract point of view, ignoring all the details of the network training (weights, layer details etc.) we can see that the action being performed is actually compression and this compression involves transferring the input data from one representation to another all the way till it is represented in the form of probabilities.

Even in something like a convolutional neural network for image recognition where there are steps that perform convolution and their other steps that perform downsampling, if we take a birds-eye view of what is happening we see that one representation of something, that is the pixel data, is being re-represented as other structures down to the level where the final representation is a bunch of probabilities. This is a deep fact of the nature of intelligence and needs some deep meditation to fully appreciate because it will lead to paths of understanding that you may not be able to see immediately.

If the network is taken as a Blackbox we see that information comes in from one end as a list of 784 items and it is transformed by some process into another list of 10 items. This abstraction/compression or transferring one form of representation to another through an intermediary structure is one aspect of what the pattern recognition component of human intelligence actually does. I am not implying that the brain is doing backpropagation, which is the core algorithm that makes deep learning processes work. What I am saying is that the brain/mind pattern recognition system actually does this kind of abstraction/compression/feature extraction process, taking lots of information (details) and transforms it into some higher representation of patterns which is what we call knowledge. When we are making an inference on an already trained network about a digit that is written in a form that it has not seen before, the network abstracts this digit to its core pattern and tries to find matches in its database of patterns for the pattern that most closely looks like the one it is currently observing. It then brings out a list of probabilities about how likely it is that the digit it is currently observing looks like some digit it knows.

Let’s say that you input a handwritten 9 that was written in a way that none of the digits in the training data was written in, the network will “generalize” this digit i.e. abstract/compress it into a compact format consisting of “features” and compare these features with what is already stored and come up with some probabilities of how likely it is that what you have input into the network is a 9.

The raw backpropagation algorithm that is at the core of deep learning is just a way to fine tune input by doing matrix by matrix multiplication with a bunch of randomly chosen weights and then perturb the network until there is a suitable set of weights that best generalizes the input data. All along the input data is being squished from a higher dimensional (that is a long list) representation to a low dimensional representation (shorter list) of probabilities.

What we have been talking about so far, where we have training data that consist of a mapping between a data item and the class of which it belongs, and we train a neural network to generalize this data such that we can perform inference on this data to return a class is called supervised learning and much specifically classification.

There is another type of supervised learning where our goal is to predict some real-valued data given some numeric input, remember the example of predicting housing prices, in this scenario what we are doing is still supervised learning but a particular type called regression.

Supervised learning is a scenario where we have data consisting of a mapping between questions represented as data items on the right of the mapping and answers on the left of the mapping, Questions -> Answers

The image of a cat on the left side is like a question while the text “cat” at the right is an answer. We supply this question and answer to the network, the middle of the image, and the network learns a representation of the Cat image on the right so that at inference/test time when we supply a different image of a cat, the network will be able to abstract away the details of the image to get a representation and try to see how close the representation obtained is close to the one it had gotten from training. If it is close enough the network responds with the right output which is the text “cat”.

In unsupervised learning we do not supply this kind of mapping like we do in supervised learning, rather we supply the data to the network and hope that it learns some kind of low dimensional representation of the data. This process is also called feature extraction and of course, the “features” are just another representation of the data preferably in a lower dimension.

The features of the data are a solid representation of the data that removes most of what is noise from the data. It gets to the core invariant things about the data so that some very concrete unchanging things can be obtained. And we also have to note that things in lower dimensions are much easier to deal with computationally.

A typical supervised learning task is clustering. In clustering, we have some input data, for example, images. Let’s say we have a bunch of handwritten digits and we want to cluster them into groups that we can visualize.

Above is such a visualization using the Google Tensorflow projector. This is an obvious case because we would know where each digit will fall into beforehand, most 6s will fall close to each other and 1s, etc. But we should wonder what we would do if we had data coming in from a source and we do not know a classification for this data but we would want to group the data in clusters, this is where unsupervised learning comes to save the day.

In the case of digit clustering above we learn a low dimensional embedding for the number, this is just a fancy way of saying that we extract core features of the image of digits. Then we represent this embedding in 3 dimensions using the TensorFlow embedding projector.

When we compute an embedding of some data or extract features of the data, we could compute what is a called a nearest function for the data. The nearest function tells us how close one set of features is to another set of features. In something like the Wolfram language, there is a function for dimensionality reduction which returns a low dimensional vector of data that it was called for. The nearest function can then be computed for this data and based on the output some kind of estimation of the distance between each low dimensional vector can be inferred. To aid in understanding this nearest data can now be projected unto 3 dimensions so that we can see visually how each data item relates to the other in space.

Highly related data items usually cluster together like we saw in the TensorFlow projector above while unrelated items are further apart in other clusters. Apart from separating items in clusters, such visualization can even provide us with information about which clusters are closer to each other like in the embedding projector image above we see that 1s are much close to 7s and this reveals that they are probably closer because of the long stroke in them. A 1 is like a 7 and what differentiates a 1 from a 7 is the possession of a second stroke. You can imagine how this kind of information would be relevant when you have data that is hard to classify explicitly. Unsupervised learning provides a way to understand Gestalt in humans, how humans form a holistic understanding of things without tearing the details apart.

The pattern recognition system of the human mind basically operates by the process of abstraction. In deep learning, we would call this feature extraction or generalization.

Abstraction is of two types basically, I do not speak exhaustively because this is what simply comes to mind. The two types of abstraction are: 1. lossy abstraction 2. lossless abstraction. This is similar to the process of compression as we can have lossy and lossless compression.

Deep learning is actually a kind of lossy compression system. When we say we are training a neural network to do a simple task like recognizing digit characters, we are taking the raw representation of images as a list of numbers (vectors) and we are abstracting away at the unnecessary details using feature extraction to obtain the most relevant pieces of information that best describes images of a particular class.

So we throw away some of the information that the system deemed as unnecessary to get at that which is relevant to our task of image identification. In essence, we have compressed the original data in a lossy fashion in order to represent it in a form that best generalizes it.

Even in the scenario where are dealing with full colour images of cats, dogs or whatever, and we have input that comes in 3 channels (RGB) as a tensor, after passing the images through layers of convolution and subsampling we actually abstract away most of the details of the image to get at the best set of information that describes the class. This information usually gets packaged in a vector of lesser dimensionality than we started with.

There are a group of networks called autoencoders whose goal is to reduce the dimensionality of an input vector. This is like taking input data like the linearized vector of an MNIST digit and passing it through several layers of the network to obtain some low dimensional compressed representation of the original input as a code. There could be some decoder phase that tries to expand the code to an output stage with the hope that the output should resemble the input. The most important thing for us to take from this is that the code is an abstract representation of the raw concrete input data.

By Chervinskii - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=45555552

This code that is generated from the input data is usually called a low dimensional embedding of the input data. For example, using the Tensorflow embedding projector we can visualize each of the 784 dimensional vectored input data of MNIST digits in 3-dimensional space after the original data has had its dimensionality reduced using something like PCA (Principal Component Analysis).

Source: projector.tensorflow.org

From a very high perspective, this is what we are doing in our deep learning systems, even when we try to recognize real-world images which are in the order of millions of dimensions, we usually pass this large vector into a network and eventually at the output end we are dealing with a vector of 1000 probabilities.

In an actual example, we spoke of MNIST digit recognition task which is the hello world program of deep learning. The process involves designing a neural network with several layers. What these layers do is not really important for this discussion, because there are a lot of complicated activities going on in every neural network, but what we need to know is that the layers squish information! Squishing information can be seen as compressing information. The image is usually input as a list of 784-pixel elements, each image contains 28 x 28 pixels so that if it brought down from two dimensions that are having 28 pixels per row and 28 pixels per column, to having 784 pixels in a single row through linearization. This length 784 list will now be fed into the network and at the other side of the network. At the output side, we get another list of 10 items where the position of the item with the highest probability values indicates the digit that the input is most likely to be.

Source: http://mxnet.incubator.apache.org/tutorials/python/mnist.html

Looking at all this from a very abstract point of view, ignoring all the details of the network training (weights, layer details etc.) we can see that the action being performed is actually compression and this compression involves transferring the input data from one representation to another all the way till it is represented in the form of probabilities.

Even in something like a convolutional neural network for image recognition where there are steps that perform convolution and their other steps that perform downsampling, if we take a birds-eye view of what is happening we see that one representation of something, that is the pixel data, is being re-represented as other structures down to the level where the final representation is a bunch of probabilities. This is a deep fact of the nature of intelligence and needs some deep meditation to fully appreciate because it will lead to paths of understanding that you may not be able to see immediately.

If the network is taken as a Blackbox we see that information comes in from one end as a list of 784 items and it is transformed by some process into another list of 10 items. This abstraction/compression or transferring one form of representation to another through an intermediary structure is one aspect of what the pattern recognition component of human intelligence actually does. I am not implying that the brain is doing backpropagation, which is the core algorithm that makes deep learning processes work. What I am saying is that the brain/mind pattern recognition system actually does this kind of abstraction/compression/feature extraction process, taking lots of information (details) and transforms it into some higher representation of patterns which is what we call knowledge. When we are making an inference on an already trained network about a digit that is written in a form that it has not seen before, the network abstracts this digit to its core pattern and tries to find matches in its database of patterns for the pattern that most closely looks like the one it is currently observing. It then brings out a list of probabilities about how likely it is that the digit it is currently observing looks like some digit it knows.

Let’s say that you input a handwritten 9 that was written in a way that none of the digits in the training data was written in, the network will “generalize” this digit i.e. abstract/compress it into a compact format consisting of “features” and compare these features with what is already stored and come up with some probabilities of how likely it is that what you have input into the network is a 9.

The raw backpropagation algorithm that is at the core of deep learning is just a way to fine tune input by doing matrix by matrix multiplication with a bunch of randomly chosen weights and then perturb the network until there is a suitable set of weights that best generalizes the input data. All along the input data is being squished from a higher dimensional (that is a long list) representation to a low dimensional representation (shorter list) of probabilities.

What we have been talking about so far, where we have training data that consist of a mapping between a data item and the class of which it belongs, and we train a neural network to generalize this data such that we can perform inference on this data to return a class is called supervised learning and much specifically classification.

There is another type of supervised learning where our goal is to predict some real-valued data given some numeric input, remember the example of predicting housing prices, in this scenario what we are doing is still supervised learning but a particular type called regression.

Supervised learning is a scenario where we have data consisting of a mapping between questions represented as data items on the right of the mapping and answers on the left of the mapping, Questions -> Answers

Source: https://www.slideshare.net/vanhuyz/kaonet-face-recognition-and-generation-app-using-deep-learning

The image of a cat on the left side is like a question while the text “cat” at the right is an answer. We supply this question and answer to the network, the middle of the image, and the network learns a representation of the Cat image on the right so that at inference/test time when we supply a different image of a cat, the network will be able to abstract away the details of the image to get a representation and try to see how close the representation obtained is close to the one it had gotten from training. If it is close enough the network responds with the right output which is the text “cat”.

In unsupervised learning we do not supply this kind of mapping like we do in supervised learning, rather we supply the data to the network and hope that it learns some kind of low dimensional representation of the data. This process is also called feature extraction and of course, the “features” are just another representation of the data preferably in a lower dimension.

The features of the data are a solid representation of the data that removes most of what is noise from the data. It gets to the core invariant things about the data so that some very concrete unchanging things can be obtained. And we also have to note that things in lower dimensions are much easier to deal with computationally.

A typical supervised learning task is clustering. In clustering, we have some input data, for example, images. Let’s say we have a bunch of handwritten digits and we want to cluster them into groups that we can visualize.

Above is such a visualization using the Google Tensorflow projector. This is an obvious case because we would know where each digit will fall into beforehand, most 6s will fall close to each other and 1s, etc. But we should wonder what we would do if we had data coming in from a source and we do not know a classification for this data but we would want to group the data in clusters, this is where unsupervised learning comes to save the day.

In the case of digit clustering above we learn a low dimensional embedding for the number, this is just a fancy way of saying that we extract core features of the image of digits. Then we represent this embedding in 3 dimensions using the TensorFlow embedding projector.

When we compute an embedding of some data or extract features of the data, we could compute what is a called a nearest function for the data. The nearest function tells us how close one set of features is to another set of features. In something like the Wolfram language, there is a function for dimensionality reduction which returns a low dimensional vector of data that it was called for. The nearest function can then be computed for this data and based on the output some kind of estimation of the distance between each low dimensional vector can be inferred. To aid in understanding this nearest data can now be projected unto 3 dimensions so that we can see visually how each data item relates to the other in space.

Highly related data items usually cluster together like we saw in the TensorFlow projector above while unrelated items are further apart in other clusters. Apart from separating items in clusters, such visualization can even provide us with information about which clusters are closer to each other like in the embedding projector image above we see that 1s are much close to 7s and this reveals that they are probably closer because of the long stroke in them. A 1 is like a 7 and what differentiates a 1 from a 7 is the possession of a second stroke. You can imagine how this kind of information would be relevant when you have data that is hard to classify explicitly. Unsupervised learning provides a way to understand Gestalt in humans, how humans form a holistic understanding of things without tearing the details apart.

## Comments

## Post a comment