Human Brain vs. Artificial Neural Network Representation

In regular programming, we usually write code in text form, but this code eventually gets transformed through several layers to a representation that the computer hardware can deal with, which are numbers. Basically just 1 and 0. But in reality, the computer doesn’t know what a number is and it is the humans who interpret the discrete states of the computer hardware as 1 and 0. There is no symbol 1 and 0 imprinted anywhere on the computer circuit, the computer circuit performs its operation on several circuit components that can be in any of the binary states of high or low voltages, it is the human who interprets these states as 1 and 0 thus ascribing a symbol to the lumped matter abstraction of computer circuitry.

Neural Networks are kind of like computer circuits because they can be in varying states represented by the range of floating point numbers. In the early perceptrons the 0 and 1 states where used for direct computation but they were found to be limited so therefore modern neural networks use floating point numbers for their computation.

So as a programmer solves compute-centric problems by writing a program which is then translated to numbers and eventually circuit states to run on a computer. A neural network engineer solves human-cognition-centric problems by designing neural network architectures which are a more restricted form of the general kinds of programs we write with computers.

In a sense, we can say that neural networks are a specialization of generic computer programs. Indeed in a practical sense, a neural network is just a composition of functions. Each so-called “layer” can be viewed as a function that receives input, performs some computation and returns an output, mostly numerical. Even the so-called layers are actually composed of simpler units like the activation functions. Designing a neural network architecture is like writing a program to solve a problem similar to what is done in traditional software, and training the neural network and eventually performing inference is similar to actually running some computer program and issuing queries to it by either typing commands or clicking on some GUI.

In a grey scale image, we only have 255-pixel values possible per pixel and we can pass that image into some deep neural network for feature extraction directly without needing a convolution step. All that is required is some linearization which converts the square matrix that is used to represent the image into a single 784-dimensional vector. In colour images, we usually have a 3 tensor and not just 2d matrix so some convolution is needed.

This is because colour images are usually made up of 3 colour components, the RGB colour system with R for red and G, B for green and blue respectively. So the image is replicated 3 times in the three colour components. So rather than just having a single 2d matrix to deal with we have 3 2d matrices which are known as a 3 tensor.

Convolving an image already represented as a 3 tensor yields a simpler representation of the image with certain dimensions reduced so as to reduce the number of parameters the network has to deal with. After the convolution step and a pooling step, the resulting structures are passed unto the fully connected layers as a single dimensional vector so that the classic deep learning algorithms can do their work.

As varied and complex as all these deep learning systems are, in the final analysis, they are extracting features sometimes using dimensionality reduction or expansion as the case may be depending on network design and return as results a vector of probabilities.

Sometimes in a neural network, we have to go wide before we go deep, this is also another scenario which can increase the dimensionality of some input vector. This is necessary for some scenarios for practical reasons to help a network do more memorization than generalization. The insight here is that in deep networks when we reduce the dimensionality we are removing “noisy” features so we can extract features of high importance which are kind of like general to the class we are dealing with. So if we are training a deep network to recognize cats and we have submitted all kinds of cat photos in all kinds of postures and environmental configurations and also different kinds of species, we want to find out a representation of cats that is independent of varying special conditions. The question that our deep network is trying to answer is, what really composes the image of a cat.

In wide networks, we might want to memorize more rather than generalize. Generalization and memorization are usually seen as opposites when designing neural networks. There is usually a scale we keep in mind when thinking about our networks, on the one end is generalization which is the goal of deep networks. On the other end is memorization which is called over capacity.

Although we might not be able to reach the highest level of generalization capacity ever we are usually okay with a level of generalization that produces the best loss during training and also good accuracy during inference. If the network capacity is too large it will tend to memorize more than generalize. If our goal is strictly generalization then this is of course not desirable. But there are scenarios where memorization is also a good thing and people have found practical ways to exploits it.

So, in essence, dimensionality reduction enables us to handle data in a compact way. This reduction can be seen as compression, where some longer vector (list) is reduced to a shorter vector. In a deep learning system in order to get at the features of the data, we usually do some kind of lossy compression of the data because we are working in a limited symbol system.

What is a symbol system? This is the format we choose to represent the data in. computers only know how to manipulate numbers and whatever we store or manipulate on a computer is merely numbers running in some electronic circuits, of course, if we go lower in our understanding the “numbers” are still another representation of the voltage levels on the computer circuitry. So, in essence, there are no numbers actually but voltage levels on the computer circuitry that we represent for ease of understanding as numbers. Up the abstraction hierarchy are the images we deal with in typical deep-learning systems, but we are not limited to images we can learn on all kinds of data even sounds.

When we are representing images to a neural network (NN), we use floating point numbers which are numbers with a decimal point like 2.3445. The precision of the floating point number is the length of the numbers after the decimal point. In the number written above the precision will be 4 because there are 4 digits .3445 after the decimal point. The precision matters because it determines the detail of granularity that our network can handle. A higher precision number will handle a greater degree of detail than a lower precision number. This is somehow analogous to the binary number system as used in a computer where the length of the bits determines how many distinct things you can encode. A 3-bit sequence can encode 8 things starting from 000 to 111. 4 bits can encode 16 things. The formula for determining how many things a particular bit length can encode is 2^N(^ means to the power, N is the length of the bit string. We have something similar with floating point numbers the greater the precision, that is the length of the bits after the decimal point, the greater the number of things you can encode. Pixel values are represented as these floating point numbers, although on the computer circuitry they are still binary, that is 1 and 0.

The NN manipulates these vectors or tensor of numbers and crunches high dimensional stuff into some class representation in the process that we call learning! There are also associated weights in the “neurons”, which are the connection strengths between “neurons” in the neural network. This is all very abstract stuff and the neurons we are talking about have no relation to neurons in the human brain. They are just a simplification derived from neuroscience because the brain contains neurons and synaptic connections between the neurons, and by a paradigm known as Hebbian learning proposed by Donald Hebb in 1949, learning actually happens when a group of neurons adjust their connection strengths in response to some input data in order to represent the input data in a generalized format as weights.

Neural Networks take the input data and perform some operations with some randomly chosen weights and perturbs the weights until the best weights that generalize the data input is found. Also, the dimensions are squished until it ends up as a vector of values representing probabilities of correct inference to input data.

Because we are dealing in a symbol system that can change in only one “dimension”, by dimension I mean that we can either increase or reduce the magnitude of the floating point number that represents some weights in the network, the kinds of features we can extract can only vary in one dimension at a time. This I call a single dimension of change, quite different from the use of the word dimension as refers to the length of a list (vector) of values. The precision of the floating point numbers stay the same during the training process, if the precision could increase and reduce during training I would say that it changes in two dimensions that is magnitude in a fixed precision and the precision itself.

Don’t get confused let me clear things out. The floating point values used to represent the weights in a typical neural network increases in magnitude (that is one dimension) only within a strict precision boundary (that is the length of the digit string constituting the number)! To accommodate a particular number of colours we could have values like 0.0000 representing black and 1.0000 representing white. All numbers between like .4567 could represent some other shade of some other colour. The detail of the colour space that can be represented depends on how many individual values occur between 0.0000 and 1.0000.

Our Neural Networks perform compression of information as they transform image data into a class through the network. At the input level we have a vector or tensor of floating point numbers that represent the image itself but as we apply operations through all the layers we are transforming the pixel values to mean other things like “features” and in the end the label of the image, what was just raw pixel values is transformed to a probability of classes which actually means something different from pixel values.

The limitations a symbol system like fixed precision floating numbers require that some “information” has to be dropped as we transform the data from one set of things that can be interpreted as “images” to another set of things that can be interpreted as “probabilities”.

But from practical neural network training, it has been found that just 16 digits of precision are enough to represent the features that are important in many tasks that NN researchers are interested in, during inference, 8 digits of precision have been found to be sufficient.

Although we might be losing information due to our reliance on fixed precision, this is not really the point I am trying to emphasize. I said earlier that we are only capturing one dimension of information by using the magnitude of our floating point numbers to represent the features. If we are strictly interested in capturing only visual information that is required to identify some object then this is all we need but humans usually capture multiple things at once from a visual scene, that is why we must build separate neural nets to capture separate things.

The visual information we receive contains various properties. If the goal of a neural network is to extract features for object identification, it would do just that. Although the hierarchical nature of the feature extractors, that is the layers of the neural network, filter the data so that we go from low-level features like edges to high-level features like faces these details can only be captured in the single dimension of the magnitude of floating point numbers, other dimensions to the image must be discarded. This is very inefficient because we would have to train and train again to adapt the network to other tasks using transfer learning or just building a new network

If we were trying to classify textures we would build a new network that represents various textures with the magnitude of floating point numbers. Every neural network has to be built for one thing and if we want to reuse features we usually resort to transfer learning, holding certain parts of some pretrained network fixed and grafting another part that is specialized to our task and training it. This is successful sometimes but sometimes we usually have to train an entire network architecture.

The details are thrown away in typical neural network configurations as the most general representation is kept for the purpose of classification, those details might have other uses in the way a human mind performs pattern recognition. It is my opinion that the human brain, rather than operating on a fixed symbol system like floating point numbers which can only be altered in only 2 dimensions like the magnitude in fixed precision and the alteration in the precision itself, the brain will preserve all input information and invent a new arbitrary symbol with sufficient properties that encode the input. This arbitrary symbol is a full-fledged network data structure not just a single number like a floating point number. And although the relationships in the network can be modelled numerically, numbers of any kind are not used directly to represent features in the brain.

All new input of a similar class will be mapped to a slightly altered representation of one encoding or another. If something new and unfamiliar is encountered a new symbol is created internally to represent it, and if the stuff that is similar to a once unfamiliar thing is observed, the symbol that was encoded will be altered in some aspect (dimension)

In all their complexity, our neural networks represent all their weight information in 2-dimensional (i.e. they can increase in width or height, spatially) weight matrices, that is a square grid.

These weight matrices hold all the information needed to generalize all the classes of input that the network was trained upon. These weights are floating point numbers that have a certain magnitude, often with a fixed precision, and when a column of these weights is multiplied element-wise with a vector of pixel values and the results summed it produces a logit. In neural network training, these logits are passed into a function called a softmax that emphasizes the winning class which the weights represent. Provided that the network has been trained until the weights converge at the appropriate values that represent the input of a particular class.

In the human brain the internal representation is not limited by a 2 dimensional weight matrix of floating point numbers but rather the symbol generated could be in any number of dimensions represented as the physical properties of the neurons beyond just the weight of the connections which is the core idea that gave rise to our artificial neural networks with weighted connections.

Each property of the neuron and its connections could represent a dimension in which an internal representation symbol is encoded in and a clique of neurons could represent many similar input structures by merely altering variables like neurotransmitter levels acting on a relatively fixed network.

SOME NOTES ON DIMENSIONS

For want of a better word, I use the word dimension to denote different things and it is right that I clarify this usage before we move further. The explanation here will clarify a lot on the usage of the word dimension in order parts of this work.

The key to understanding the word dimension as I use it in the work is to pay attention to the context in which I am using it, it's complicated I know but I will try my best to do some explaining here. Hope it helps.

In mathematics, we build up dimensions starting from the point. Several points align to form a line, and this line is called the first dimension. Several lines lined up side by side will form a plane, which is the second dimension and several planes stacked upon each other will form a cube, the 3 dimension.

Sincerely these descriptions of dimensions look arbitrary to me, it works for most purposes. We can generalize the Cube to something called the cartesian coordinates, that is the typical X, Y, Z coordinate system. The cartesian coordinate system can be used to identify any object in space by enclosing it in an imaginary box and choosing one vertex of the square as the origin and choosing the 3 edges that go out of this vertex as going in the X, Y and Z directions.

To find the object we only need to move in 3 directions, a certain distance in X, then in Y and then in Z. Theoretically, the order in which we move is irrelevant as we can arrive at the object even though we start moving in the Z directions first. We store the information of how to arrive at the object in a Vector (List) of 3 values (X, Y, Z) with each value representing how much we must move in each direction. This location information is all we need to find some object using the coordinate system.

We say that the world we see is 3 dimensional because anything in it can be pinpointed by moving in only 3 directions. We say an object is 3 dimensional because if we place it in some cartesian coordinate system like we have described above, we can locate all the surface points on the objects using the coordinate system.

When we are faced with the problem of how to represent objects in higher dimensions we can simply generalize points in a vector system by saying that since a point in 3 dimensional space is represented as a vector with 3 items representing the movements we must make in the 3 directions to get the location of an object, in something like 4 dimensions which is hard to visualize in 3 dimensional space, we could simply say a point in 4 dimensions can be represented as a vector with 4 items like (W, X, Y, Z). If we have a 4-dimensional object we can ideally shrink it to a point in 4D space and identify its location using movements along 4 axes. A 4-dimensional cube is called a Tesseract

In 4 dimensions a simple object like the cube we are used to becomes very complex, without strain our minds much to see things in 4 dimensions and above we could simply use the abstraction of the vector of numbers representing a point in whatever space we are dealing with.

With this system, a point in 20-dimensional space could simply be represented by a vector of 20 numbers, you don't need to try to visualize it.

If you have read some book on machine learning or AI the first thing that might spark up some confusion when you get to an example of training a neural network to recognize MNIST digits is when you see an image which is like a square, and you know squares are 2 dimensions, is transformed into something with 784 dimensions! I was confused myself when I first encountered this.

The main purpose of this subsection is to clarify these kinds of confusions and enable you to know exactly what someone is talking about when they mention dimensions. It is usually helpful to decode their exact meaning from the context of what they are talking about.

Some authors for clarity sake will not say that an image is in 2 dimensions, rather they will say that the pixel values for the image are represented on a square grid. But changing the name doesn't help much because you know that a square grid is a matrix and a matrix is 2 dimensional.

What is usually referred to when talking about something like an image with input pixels flattened to 784 dimensions is this. If every pixel is movement along some coordinate in some space just like the X element in the cartesian coordinate is a movement along the X direction in the cartesian coordinate system, then the image itself is a point in 784-dimensional space.

There in the MNIST image recognition task where the goal of training the network is to make sure a handwritten image of a 9, when passed to the network, returns the correct list of probabilities where the highest is a prediction that the input was a 9. Since we are dealing with a list of 10 probabilities here we can say that we are dealing with a 10-dimensional space, where each probability is a movement along some axis and the entire list a point in that space where the 784-dimensional vector of a handwritten image maps to the correct prediction in 10-dimensional space.

Earlier on I talked about a concept of dimension with respect to a floating point, that is the magnitude in fixed precision being one dimension while the precision itself being another. I went ahead to say that the brain, unlike our artificial neural networks, doesn't use this kind of 2-dimensional symbol system to encode individual properties of the information that it encounters.

A matrix of pixel values can also be seen as a 2-dimensional representation of an image which the human visual system can recognize as one when the pixels are turned into colours and presented on some medium, like paper or on a screen. In practice, the image is separated into 3 aspects represented by the Red, Green and Blue channels that come together to make a full-colour image. Together the 3 matrices are called a tensor, a 3 tensor which is a volume. A collection of these volumes is a 4 tensor and a collection of these 4 tensors is a 5 tensor.

No matter the degree of a tensor it can all be flattened to a 1-dimensional vector where each element represents a point in the space where such a tensor is a single object.

In general, a dimension is a collection of points representing singular properties of a particular system, call it movements in certain directions of the space or whatever. This collection of points can itself be a point in a larger collection, forming nested dimensions, just like the dimension of magnitude with a fixed precision and the precision itself are two dimensions of a floating point number which when in a collection like a vector represents a point in some dimensional space.

Actually, floating point numbers can be represented in one dimension as just the magnitude of the number, but when we fix the precision we can then vary the magnitude within a boundary defined by the precision or alter the precision itself or vary it along the direction of the precision.

Let us see a practical example. Each of the 5 tastes that the tongue can identify is a direction in the space of possible tastes. Anything we can taste with our tongues must be composed of at least 5 numbers each representing the magnitude one the five tastes like sweetness, sourness, saltiness, bitterness, and umami. The magnitude of each taste be represented by a single floating point number or any number system of our choice, i.e. how sweet something is can be on a scale of 10 or 0.0 to 1.0 single precision. No sweetness at all could be 0.0 sweet could be 0.5 and very sweet could be 1.0. The precision dimension gives us the degree to which we want to distinguish between different states of sweetness or the granularity of our measurement. If we can distinguish 100 degrees of sweetness then we could use 2 digits of precision like 0.64 etc.

The vector containing the individual properties of taste describes a point in a space of possible tastes so we can describe the taste of something like an apple using 5 numbers like {0.2, 0.6, 0.3, 0.1,0.2}, this is just some arbitrary list and has nothing to do with the actual measurement of the taste properties of any actual apple but you get the gist. The size of that space depends on the precision of the number used to describe the granularity of a particular property in that space. If we cared about only 10 degrees of variation per property such as saltiness the size of that space (the maximum number of points it can contain) is 10^5 or 10000 (^ means to the power). So with 10 degrees of variation per property, we can have 10000 possible tastes.

The brain doesn't break up an image into pixels like machines do, but rather takes in the whole scene it is observing and transforms that scene into some internal representation, which abstractly can be represented with some data structure but concretely is intimately related to the physical properties of the brain beyond just the weighted neural connections. Even if this scene is limited to that of a single object by negating all other objects in the background with something like white, the human brain takes in the whole structure of the object it is observing at once and represents it in the network structures that are responsible for visual recognition and all other centres for processing other properties. But if we are to isolate the neural network that records a single item, it won't be just weighted connections that contribute to the representation but a lot of other things like the internal state of the neuron and all its complex metabolics, things like dendritic surfaces, neurotransmitter levels, etc.

Generalizing we can that the symbol used to represent a new observation is some expression of the state of the brain that is induced by the perception of that symbol, and the brain recognizes the object when it encounters it again by recreating the state it was in when it encountered it the first time and with appropriate comparison it can perform further processing leading to identification.

So the brain could be fully economical with information and rather than performing dimensionality reduction by throwing information that it considers irrelevant away, it could simply invent a new symbol to encode the new level of meaning it has extracted from the data. This is much more similar to the kind of abstraction we witness in computer science. At the lowest level of the abstraction hierarchy which is the circuit level, we have voltage levels determining what is off and off. Then we have machine language for manipulating these voltage levels to represent our computations. The machine language usually consists of strings of 0s and 1s which are in a form that we humans can interact with but in order to become the actual control of logic gates on the circuitry an interface converts these strings that we input to the computer into voltage levels. Circuitry then uses some of our input as instructions and some as data for which the instructions will operate on and performs the needed operations that transform our input via computation to our output.

The next level of the computer abstraction is the level of assembly language, where we don’t throwaway machine code but rather wrap up some sequence of instructions and data, which is called a program into another symbol system, a mnemonic of characters that can be used to represent a bunch of raw binary instructions. There is no loss here because rather than transforming information in a lossy fashion to another layer like is done in artificial neural networks, we transform the binary information by mapping it to another layer of symbols represented by word characters where a single statement made in this word character represents a bunch of underlying machine instructions which end up as voltage levels on the circuit.

This is the kind of mechanism that the brain uses as it abstracts/compresses away the “details” out the raw information streams. Its a transformation from one symbol set into another. As we continue borrowing understanding from the abstraction hierarchy of the computer system, it will be nice if we include the level abstraction known as the high-level language level. At this level, more compression takes place. We should view compression here not as saving space in the physical universe. When a bunch of machine instructions gets converted into assembly language we have actually added more information into the physical universe because we must maintain a kind of mapping between machine instructions and assembly instructions somewhere. This actually uses more atoms out of the universe so we have not succeeded in “compressing” just information. What we compress here is economy of action, meaning that the amount of typing and cognition that will go into forming higher concepts is reduced the higher we go up the abstraction ladder.

We can use larger chunks of knowledge to compose other larger chunks of knowledge without thinking too much about the lower mechanisms that are required for action. If we use a high level computer language construct like Print[“Hello “World”], how the code is compiled into assembly language by a compiler, which is a bunch of code that translates high-level constructs into assembly language and eventually another software called assembler translates assembly language into machine code which becomes direct controls for the electronic circuitry that does the actual computation. The results of the computation which is a state on the electronic circuitry are then translated up the hierarchy and you receive your results in a form that is easier for you to consume.

In neural networks, we are actually doing this kind of meaning transformation by lossy compression. Another way to look at neural networks is like a bunch of filters that we use to filter raw data with lots of noise or detail that are not important to a higher conception of the information. There is always some dimensionality reduction going on because, the original data, let's say an image is usually of a higher dimensional nature than the classes we wish to obtain out of the training effort. When we use a mechanical filter to filter stuff to separate larger from small items, we usually shake the filter so that the small stuff passes through and the big stuff stays back, which of these are important to us is inconsequential to his discussion. The point is that we are trying to draw some analogy to the filtering process. The neural network layers say no to certain kinds of data, mostly by thresholding using the activation function, and yes to others. The perturbation of the network with back propagation algorithms is similar to shaking a filter so that the important stuff either stays above or goes below.

So what is an example of a detail that some neural network layer might filter out? One interesting one to consider is translational invariance, translation means moving from one place to another and invariance meaning not changing. If we had images of the same dog in multiple locations within a non-changing background, for example in one image a dog is at the upper left corner of the image, at another, it is at the lower right, centre some inches to the left and so on. We want the network to able to say that a dog is a dog independent of where in the picture we find the dog. The details of the locations will be abstracted/compressed/filtered away! This is the kind of information that is filtered away, by the training process occurring at a certain layer, the invariant here is the image of the dog and that which can vary is the location on the picture. The feature of interest here is the dog itself!

There are other layers that can eliminate other kinds of data like rotational invariance, a dog is a dog whether upside down or mirror reflected. Other layers can be tasked with saying that what a dog really is independent of species. Some layers are for identifying features like lines that make up a dog, etc. at the end of it all the most generalized representation of a dog is the result of the training process and it is on that generalization that any new image of a dog, supposing the network is well trained on large amounts of data, will be correctly inferred as a dog.

In the attempt of getting at a single goal which is identifying a particular class of objects in the image, we lose a lot of other information which of course looks irrelevant to our current task. Our brain doesn't filter out information like this.

Rather it learns everything learnable from a scene. I don't know how many things exactly it learns but it is a lot because all that it learns will become relevant in other thinking scenarios much later on. That is why you could walk past the street and your eyes are gazing but taking no particular attention to anything, but days later you hear on the news that something happened a few minutes after you walked passed to a car that was parked on that street and you recall that you saw that car, and depending on your mental abilities you can recall some facts about the car, the weather that day, sounds you heard, who you saw walking by etc.

While you were walking by that day your conscious mind must have been engaged in other tasks but your subconscious took notice of the environment and thus learned a lot from it. Your brain is doing much more than you're consciously aware of.

As we develop the field of artificial neural networks I think we should be moving towards these kinds of networks that learn more than what we are currently interested in so that at a later time we can query these systems for other kinds of features and they will readily adjust.

In regular programming, we usually write code in text form, but this code eventually gets transformed through several layers to a representation that the computer hardware can deal with, which are numbers. Basically just 1 and 0. But in reality, the computer doesn’t know what a number is and it is the humans who interpret the discrete states of the computer hardware as 1 and 0. There is no symbol 1 and 0 imprinted anywhere on the computer circuit, the computer circuit performs its operation on several circuit components that can be in any of the binary states of high or low voltages, it is the human who interprets these states as 1 and 0 thus ascribing a symbol to the lumped matter abstraction of computer circuitry.

Neural Networks are kind of like computer circuits because they can be in varying states represented by the range of floating point numbers. In the early perceptrons the 0 and 1 states where used for direct computation but they were found to be limited so therefore modern neural networks use floating point numbers for their computation.

So as a programmer solves compute-centric problems by writing a program which is then translated to numbers and eventually circuit states to run on a computer. A neural network engineer solves human-cognition-centric problems by designing neural network architectures which are a more restricted form of the general kinds of programs we write with computers.

In a sense, we can say that neural networks are a specialization of generic computer programs. Indeed in a practical sense, a neural network is just a composition of functions. Each so-called “layer” can be viewed as a function that receives input, performs some computation and returns an output, mostly numerical. Even the so-called layers are actually composed of simpler units like the activation functions. Designing a neural network architecture is like writing a program to solve a problem similar to what is done in traditional software, and training the neural network and eventually performing inference is similar to actually running some computer program and issuing queries to it by either typing commands or clicking on some GUI.

In a grey scale image, we only have 255-pixel values possible per pixel and we can pass that image into some deep neural network for feature extraction directly without needing a convolution step. All that is required is some linearization which converts the square matrix that is used to represent the image into a single 784-dimensional vector. In colour images, we usually have a 3 tensor and not just 2d matrix so some convolution is needed.

This is because colour images are usually made up of 3 colour components, the RGB colour system with R for red and G, B for green and blue respectively. So the image is replicated 3 times in the three colour components. So rather than just having a single 2d matrix to deal with we have 3 2d matrices which are known as a 3 tensor.

Convolving an image already represented as a 3 tensor yields a simpler representation of the image with certain dimensions reduced so as to reduce the number of parameters the network has to deal with. After the convolution step and a pooling step, the resulting structures are passed unto the fully connected layers as a single dimensional vector so that the classic deep learning algorithms can do their work.

As varied and complex as all these deep learning systems are, in the final analysis, they are extracting features sometimes using dimensionality reduction or expansion as the case may be depending on network design and return as results a vector of probabilities.

Sometimes in a neural network, we have to go wide before we go deep, this is also another scenario which can increase the dimensionality of some input vector. This is necessary for some scenarios for practical reasons to help a network do more memorization than generalization. The insight here is that in deep networks when we reduce the dimensionality we are removing “noisy” features so we can extract features of high importance which are kind of like general to the class we are dealing with. So if we are training a deep network to recognize cats and we have submitted all kinds of cat photos in all kinds of postures and environmental configurations and also different kinds of species, we want to find out a representation of cats that is independent of varying special conditions. The question that our deep network is trying to answer is, what really composes the image of a cat.

In wide networks, we might want to memorize more rather than generalize. Generalization and memorization are usually seen as opposites when designing neural networks. There is usually a scale we keep in mind when thinking about our networks, on the one end is generalization which is the goal of deep networks. On the other end is memorization which is called over capacity.

Although we might not be able to reach the highest level of generalization capacity ever we are usually okay with a level of generalization that produces the best loss during training and also good accuracy during inference. If the network capacity is too large it will tend to memorize more than generalize. If our goal is strictly generalization then this is of course not desirable. But there are scenarios where memorization is also a good thing and people have found practical ways to exploits it.

So, in essence, dimensionality reduction enables us to handle data in a compact way. This reduction can be seen as compression, where some longer vector (list) is reduced to a shorter vector. In a deep learning system in order to get at the features of the data, we usually do some kind of lossy compression of the data because we are working in a limited symbol system.

What is a symbol system? This is the format we choose to represent the data in. computers only know how to manipulate numbers and whatever we store or manipulate on a computer is merely numbers running in some electronic circuits, of course, if we go lower in our understanding the “numbers” are still another representation of the voltage levels on the computer circuitry. So, in essence, there are no numbers actually but voltage levels on the computer circuitry that we represent for ease of understanding as numbers. Up the abstraction hierarchy are the images we deal with in typical deep-learning systems, but we are not limited to images we can learn on all kinds of data even sounds.

When we are representing images to a neural network (NN), we use floating point numbers which are numbers with a decimal point like 2.3445. The precision of the floating point number is the length of the numbers after the decimal point. In the number written above the precision will be 4 because there are 4 digits .3445 after the decimal point. The precision matters because it determines the detail of granularity that our network can handle. A higher precision number will handle a greater degree of detail than a lower precision number. This is somehow analogous to the binary number system as used in a computer where the length of the bits determines how many distinct things you can encode. A 3-bit sequence can encode 8 things starting from 000 to 111. 4 bits can encode 16 things. The formula for determining how many things a particular bit length can encode is 2^N(^ means to the power, N is the length of the bit string. We have something similar with floating point numbers the greater the precision, that is the length of the bits after the decimal point, the greater the number of things you can encode. Pixel values are represented as these floating point numbers, although on the computer circuitry they are still binary, that is 1 and 0.

The NN manipulates these vectors or tensor of numbers and crunches high dimensional stuff into some class representation in the process that we call learning! There are also associated weights in the “neurons”, which are the connection strengths between “neurons” in the neural network. This is all very abstract stuff and the neurons we are talking about have no relation to neurons in the human brain. They are just a simplification derived from neuroscience because the brain contains neurons and synaptic connections between the neurons, and by a paradigm known as Hebbian learning proposed by Donald Hebb in 1949, learning actually happens when a group of neurons adjust their connection strengths in response to some input data in order to represent the input data in a generalized format as weights.

Neural Networks take the input data and perform some operations with some randomly chosen weights and perturbs the weights until the best weights that generalize the data input is found. Also, the dimensions are squished until it ends up as a vector of values representing probabilities of correct inference to input data.

Because we are dealing in a symbol system that can change in only one “dimension”, by dimension I mean that we can either increase or reduce the magnitude of the floating point number that represents some weights in the network, the kinds of features we can extract can only vary in one dimension at a time. This I call a single dimension of change, quite different from the use of the word dimension as refers to the length of a list (vector) of values. The precision of the floating point numbers stay the same during the training process, if the precision could increase and reduce during training I would say that it changes in two dimensions that is magnitude in a fixed precision and the precision itself.

Don’t get confused let me clear things out. The floating point values used to represent the weights in a typical neural network increases in magnitude (that is one dimension) only within a strict precision boundary (that is the length of the digit string constituting the number)! To accommodate a particular number of colours we could have values like 0.0000 representing black and 1.0000 representing white. All numbers between like .4567 could represent some other shade of some other colour. The detail of the colour space that can be represented depends on how many individual values occur between 0.0000 and 1.0000.

Our Neural Networks perform compression of information as they transform image data into a class through the network. At the input level we have a vector or tensor of floating point numbers that represent the image itself but as we apply operations through all the layers we are transforming the pixel values to mean other things like “features” and in the end the label of the image, what was just raw pixel values is transformed to a probability of classes which actually means something different from pixel values.

The limitations a symbol system like fixed precision floating numbers require that some “information” has to be dropped as we transform the data from one set of things that can be interpreted as “images” to another set of things that can be interpreted as “probabilities”.

But from practical neural network training, it has been found that just 16 digits of precision are enough to represent the features that are important in many tasks that NN researchers are interested in, during inference, 8 digits of precision have been found to be sufficient.

Although we might be losing information due to our reliance on fixed precision, this is not really the point I am trying to emphasize. I said earlier that we are only capturing one dimension of information by using the magnitude of our floating point numbers to represent the features. If we are strictly interested in capturing only visual information that is required to identify some object then this is all we need but humans usually capture multiple things at once from a visual scene, that is why we must build separate neural nets to capture separate things.

The visual information we receive contains various properties. If the goal of a neural network is to extract features for object identification, it would do just that. Although the hierarchical nature of the feature extractors, that is the layers of the neural network, filter the data so that we go from low-level features like edges to high-level features like faces these details can only be captured in the single dimension of the magnitude of floating point numbers, other dimensions to the image must be discarded. This is very inefficient because we would have to train and train again to adapt the network to other tasks using transfer learning or just building a new network

If we were trying to classify textures we would build a new network that represents various textures with the magnitude of floating point numbers. Every neural network has to be built for one thing and if we want to reuse features we usually resort to transfer learning, holding certain parts of some pretrained network fixed and grafting another part that is specialized to our task and training it. This is successful sometimes but sometimes we usually have to train an entire network architecture.

The details are thrown away in typical neural network configurations as the most general representation is kept for the purpose of classification, those details might have other uses in the way a human mind performs pattern recognition. It is my opinion that the human brain, rather than operating on a fixed symbol system like floating point numbers which can only be altered in only 2 dimensions like the magnitude in fixed precision and the alteration in the precision itself, the brain will preserve all input information and invent a new arbitrary symbol with sufficient properties that encode the input. This arbitrary symbol is a full-fledged network data structure not just a single number like a floating point number. And although the relationships in the network can be modelled numerically, numbers of any kind are not used directly to represent features in the brain.

All new input of a similar class will be mapped to a slightly altered representation of one encoding or another. If something new and unfamiliar is encountered a new symbol is created internally to represent it, and if the stuff that is similar to a once unfamiliar thing is observed, the symbol that was encoded will be altered in some aspect (dimension)

In all their complexity, our neural networks represent all their weight information in 2-dimensional (i.e. they can increase in width or height, spatially) weight matrices, that is a square grid.

In the human brain the internal representation is not limited by a 2 dimensional weight matrix of floating point numbers but rather the symbol generated could be in any number of dimensions represented as the physical properties of the neurons beyond just the weight of the connections which is the core idea that gave rise to our artificial neural networks with weighted connections.

Each property of the neuron and its connections could represent a dimension in which an internal representation symbol is encoded in and a clique of neurons could represent many similar input structures by merely altering variables like neurotransmitter levels acting on a relatively fixed network.

SOME NOTES ON DIMENSIONS

For want of a better word, I use the word dimension to denote different things and it is right that I clarify this usage before we move further. The explanation here will clarify a lot on the usage of the word dimension in order parts of this work.

The key to understanding the word dimension as I use it in the work is to pay attention to the context in which I am using it, it's complicated I know but I will try my best to do some explaining here. Hope it helps.

In mathematics, we build up dimensions starting from the point. Several points align to form a line, and this line is called the first dimension. Several lines lined up side by side will form a plane, which is the second dimension and several planes stacked upon each other will form a cube, the 3 dimension.

Sincerely these descriptions of dimensions look arbitrary to me, it works for most purposes. We can generalize the Cube to something called the cartesian coordinates, that is the typical X, Y, Z coordinate system. The cartesian coordinate system can be used to identify any object in space by enclosing it in an imaginary box and choosing one vertex of the square as the origin and choosing the 3 edges that go out of this vertex as going in the X, Y and Z directions.

By Jorge Stolfi - Own work, Public Domain, https://commons.wikimedia.org/w/index.php?curid=6692547

To find the object we only need to move in 3 directions, a certain distance in X, then in Y and then in Z. Theoretically, the order in which we move is irrelevant as we can arrive at the object even though we start moving in the Z directions first. We store the information of how to arrive at the object in a Vector (List) of 3 values (X, Y, Z) with each value representing how much we must move in each direction. This location information is all we need to find some object using the coordinate system.

We say that the world we see is 3 dimensional because anything in it can be pinpointed by moving in only 3 directions. We say an object is 3 dimensional because if we place it in some cartesian coordinate system like we have described above, we can locate all the surface points on the objects using the coordinate system.

When we are faced with the problem of how to represent objects in higher dimensions we can simply generalize points in a vector system by saying that since a point in 3 dimensional space is represented as a vector with 3 items representing the movements we must make in the 3 directions to get the location of an object, in something like 4 dimensions which is hard to visualize in 3 dimensional space, we could simply say a point in 4 dimensions can be represented as a vector with 4 items like (W, X, Y, Z). If we have a 4-dimensional object we can ideally shrink it to a point in 4D space and identify its location using movements along 4 axes. A 4-dimensional cube is called a Tesseract

By The original uploader was Tomruen at English Wikipedia. - Transferred from en.wikipedia to Commons by Jalo using CommonsHelper., CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=4694970

In 4 dimensions a simple object like the cube we are used to becomes very complex, without strain our minds much to see things in 4 dimensions and above we could simply use the abstraction of the vector of numbers representing a point in whatever space we are dealing with.

With this system, a point in 20-dimensional space could simply be represented by a vector of 20 numbers, you don't need to try to visualize it.

If you have read some book on machine learning or AI the first thing that might spark up some confusion when you get to an example of training a neural network to recognize MNIST digits is when you see an image which is like a square, and you know squares are 2 dimensions, is transformed into something with 784 dimensions! I was confused myself when I first encountered this.

The main purpose of this subsection is to clarify these kinds of confusions and enable you to know exactly what someone is talking about when they mention dimensions. It is usually helpful to decode their exact meaning from the context of what they are talking about.

Some authors for clarity sake will not say that an image is in 2 dimensions, rather they will say that the pixel values for the image are represented on a square grid. But changing the name doesn't help much because you know that a square grid is a matrix and a matrix is 2 dimensional.

What is usually referred to when talking about something like an image with input pixels flattened to 784 dimensions is this. If every pixel is movement along some coordinate in some space just like the X element in the cartesian coordinate is a movement along the X direction in the cartesian coordinate system, then the image itself is a point in 784-dimensional space.

There in the MNIST image recognition task where the goal of training the network is to make sure a handwritten image of a 9, when passed to the network, returns the correct list of probabilities where the highest is a prediction that the input was a 9. Since we are dealing with a list of 10 probabilities here we can say that we are dealing with a 10-dimensional space, where each probability is a movement along some axis and the entire list a point in that space where the 784-dimensional vector of a handwritten image maps to the correct prediction in 10-dimensional space.

Earlier on I talked about a concept of dimension with respect to a floating point, that is the magnitude in fixed precision being one dimension while the precision itself being another. I went ahead to say that the brain, unlike our artificial neural networks, doesn't use this kind of 2-dimensional symbol system to encode individual properties of the information that it encounters.

A matrix of pixel values can also be seen as a 2-dimensional representation of an image which the human visual system can recognize as one when the pixels are turned into colours and presented on some medium, like paper or on a screen. In practice, the image is separated into 3 aspects represented by the Red, Green and Blue channels that come together to make a full-colour image. Together the 3 matrices are called a tensor, a 3 tensor which is a volume. A collection of these volumes is a 4 tensor and a collection of these 4 tensors is a 5 tensor.

No matter the degree of a tensor it can all be flattened to a 1-dimensional vector where each element represents a point in the space where such a tensor is a single object.

In general, a dimension is a collection of points representing singular properties of a particular system, call it movements in certain directions of the space or whatever. This collection of points can itself be a point in a larger collection, forming nested dimensions, just like the dimension of magnitude with a fixed precision and the precision itself are two dimensions of a floating point number which when in a collection like a vector represents a point in some dimensional space.

Actually, floating point numbers can be represented in one dimension as just the magnitude of the number, but when we fix the precision we can then vary the magnitude within a boundary defined by the precision or alter the precision itself or vary it along the direction of the precision.

Let us see a practical example. Each of the 5 tastes that the tongue can identify is a direction in the space of possible tastes. Anything we can taste with our tongues must be composed of at least 5 numbers each representing the magnitude one the five tastes like sweetness, sourness, saltiness, bitterness, and umami. The magnitude of each taste be represented by a single floating point number or any number system of our choice, i.e. how sweet something is can be on a scale of 10 or 0.0 to 1.0 single precision. No sweetness at all could be 0.0 sweet could be 0.5 and very sweet could be 1.0. The precision dimension gives us the degree to which we want to distinguish between different states of sweetness or the granularity of our measurement. If we can distinguish 100 degrees of sweetness then we could use 2 digits of precision like 0.64 etc.

The vector containing the individual properties of taste describes a point in a space of possible tastes so we can describe the taste of something like an apple using 5 numbers like {0.2, 0.6, 0.3, 0.1,0.2}, this is just some arbitrary list and has nothing to do with the actual measurement of the taste properties of any actual apple but you get the gist. The size of that space depends on the precision of the number used to describe the granularity of a particular property in that space. If we cared about only 10 degrees of variation per property such as saltiness the size of that space (the maximum number of points it can contain) is 10^5 or 10000 (^ means to the power). So with 10 degrees of variation per property, we can have 10000 possible tastes.

The brain doesn't break up an image into pixels like machines do, but rather takes in the whole scene it is observing and transforms that scene into some internal representation, which abstractly can be represented with some data structure but concretely is intimately related to the physical properties of the brain beyond just the weighted neural connections. Even if this scene is limited to that of a single object by negating all other objects in the background with something like white, the human brain takes in the whole structure of the object it is observing at once and represents it in the network structures that are responsible for visual recognition and all other centres for processing other properties. But if we are to isolate the neural network that records a single item, it won't be just weighted connections that contribute to the representation but a lot of other things like the internal state of the neuron and all its complex metabolics, things like dendritic surfaces, neurotransmitter levels, etc.

Generalizing we can that the symbol used to represent a new observation is some expression of the state of the brain that is induced by the perception of that symbol, and the brain recognizes the object when it encounters it again by recreating the state it was in when it encountered it the first time and with appropriate comparison it can perform further processing leading to identification.

* * *

The next level of the computer abstraction is the level of assembly language, where we don’t throwaway machine code but rather wrap up some sequence of instructions and data, which is called a program into another symbol system, a mnemonic of characters that can be used to represent a bunch of raw binary instructions. There is no loss here because rather than transforming information in a lossy fashion to another layer like is done in artificial neural networks, we transform the binary information by mapping it to another layer of symbols represented by word characters where a single statement made in this word character represents a bunch of underlying machine instructions which end up as voltage levels on the circuit.

This is the kind of mechanism that the brain uses as it abstracts/compresses away the “details” out the raw information streams. Its a transformation from one symbol set into another. As we continue borrowing understanding from the abstraction hierarchy of the computer system, it will be nice if we include the level abstraction known as the high-level language level. At this level, more compression takes place. We should view compression here not as saving space in the physical universe. When a bunch of machine instructions gets converted into assembly language we have actually added more information into the physical universe because we must maintain a kind of mapping between machine instructions and assembly instructions somewhere. This actually uses more atoms out of the universe so we have not succeeded in “compressing” just information. What we compress here is economy of action, meaning that the amount of typing and cognition that will go into forming higher concepts is reduced the higher we go up the abstraction ladder.

We can use larger chunks of knowledge to compose other larger chunks of knowledge without thinking too much about the lower mechanisms that are required for action. If we use a high level computer language construct like Print[“Hello “World”], how the code is compiled into assembly language by a compiler, which is a bunch of code that translates high-level constructs into assembly language and eventually another software called assembler translates assembly language into machine code which becomes direct controls for the electronic circuitry that does the actual computation. The results of the computation which is a state on the electronic circuitry are then translated up the hierarchy and you receive your results in a form that is easier for you to consume.

In neural networks, we are actually doing this kind of meaning transformation by lossy compression. Another way to look at neural networks is like a bunch of filters that we use to filter raw data with lots of noise or detail that are not important to a higher conception of the information. There is always some dimensionality reduction going on because, the original data, let's say an image is usually of a higher dimensional nature than the classes we wish to obtain out of the training effort. When we use a mechanical filter to filter stuff to separate larger from small items, we usually shake the filter so that the small stuff passes through and the big stuff stays back, which of these are important to us is inconsequential to his discussion. The point is that we are trying to draw some analogy to the filtering process. The neural network layers say no to certain kinds of data, mostly by thresholding using the activation function, and yes to others. The perturbation of the network with back propagation algorithms is similar to shaking a filter so that the important stuff either stays above or goes below.

So what is an example of a detail that some neural network layer might filter out? One interesting one to consider is translational invariance, translation means moving from one place to another and invariance meaning not changing. If we had images of the same dog in multiple locations within a non-changing background, for example in one image a dog is at the upper left corner of the image, at another, it is at the lower right, centre some inches to the left and so on. We want the network to able to say that a dog is a dog independent of where in the picture we find the dog. The details of the locations will be abstracted/compressed/filtered away! This is the kind of information that is filtered away, by the training process occurring at a certain layer, the invariant here is the image of the dog and that which can vary is the location on the picture. The feature of interest here is the dog itself!

There are other layers that can eliminate other kinds of data like rotational invariance, a dog is a dog whether upside down or mirror reflected. Other layers can be tasked with saying that what a dog really is independent of species. Some layers are for identifying features like lines that make up a dog, etc. at the end of it all the most generalized representation of a dog is the result of the training process and it is on that generalization that any new image of a dog, supposing the network is well trained on large amounts of data, will be correctly inferred as a dog.

In the attempt of getting at a single goal which is identifying a particular class of objects in the image, we lose a lot of other information which of course looks irrelevant to our current task. Our brain doesn't filter out information like this.

Rather it learns everything learnable from a scene. I don't know how many things exactly it learns but it is a lot because all that it learns will become relevant in other thinking scenarios much later on. That is why you could walk past the street and your eyes are gazing but taking no particular attention to anything, but days later you hear on the news that something happened a few minutes after you walked passed to a car that was parked on that street and you recall that you saw that car, and depending on your mental abilities you can recall some facts about the car, the weather that day, sounds you heard, who you saw walking by etc.

While you were walking by that day your conscious mind must have been engaged in other tasks but your subconscious took notice of the environment and thus learned a lot from it. Your brain is doing much more than you're consciously aware of.

As we develop the field of artificial neural networks I think we should be moving towards these kinds of networks that learn more than what we are currently interested in so that at a later time we can query these systems for other kinds of features and they will readily adjust.

## Comments

## Post a comment