What is Intelligence: Beyond Images

BEYOND IMAGES

So far we have been talking about Convolutional Neural Networks as applied to problems of image classification and I have mentioned recurrent neural networks (RNNs) in passing in previous chapters, but we must be wondering how humans learn information that is not static like images but sequential in nature. Information that may be spatially (in space) or temporally (in time) related?


This is where RNNs come in. With image classification or regression tasks, we try to learn a representation of the input data so that we can predict some new input, with RNNs we learn a representation of sequential data so that we could generate it with high confidence.

What is actually a sequence? A sequence is a vector or a list of stuff that are either related in space, like words written on paper or are related in time like speech spoken by some entity. What is important is how we treat the vector that is input into the system.

While in typical classification we feed a vector containing pixel values, and although pixel values are spatially related because they occur side by side in space, we usually want to observe the whole pixels at a time to give us the meaning of what we are observing. Given a vector of pixels that constitute an image of a cat we know beforehand that the vector represents a single image so we build a neural network to extract the features like we have described previously. But what if we were given a sentence or a bunch of sentences and we are asked to build a system that “understands” these sentences.

When we want to use text as an input to some kind of neural network we usually generate an embedding vector which is a numeric representation of the text input. The only representation our machines can use is numeric and thus even text just like images must first be converted to numbers before we feed it into the system. After that, we go through all the required layers of our model and at the other end we can say that a bunch of text input falls into a particular class or even predict some kind of number in a regression task from text input. The other scenario is when we want to generate text that has some kind of relationship to what we have input which is where RNNs come into play.

Most of the time when we use the word "understanding", we are not usually clear on exactly what we mean, words are muddy and I will go into detail much later on why words and their meanings can be a great source of confusion when trying to solve problems.

Understanding in its clearest sense simply means transferring something from one representation to a representation that we are much comfortable with. If we draw a picture of the solar system it is because we are choosing a representation that will give us a compact impression which is understanding.

When some piece of information or data is transferred from one representation to another our goal is not usually fidelity. If we are seeking perfect fidelity then we are merely copying the data, and most of the time it is hard to get 100% fidelity between two forms of representation.

As concerns understanding, we are trying to unify several disparate pieces of information to find the underlying connection between them. This is what modern AI systems do for us. They are good at relating disparate pieces of information by representing them in a unified manner.

An AI system learns different images of cats so that it can form some kind of unified description of cats in all kinds of scenarios so that when presented with a new image of a cat it can search its database of features and determine if the newly presented image is a cat or something else.

This is akin to the way humans understands stuff, its as if the central goal of cognition is unifying all input data, because we are constantly finding relations between disparate pieces of information we must have encountered in different points in time and space or even in mental space in the world of imagination.

The human consciousness seems to constantly try to find relations between things in its world as if it knew apriori that everything was kind of connected and it was only seeking these connections no different from the game of jigsaw puzzles where we know beforehand that there is a single image and our goal is to search for the pieces and find exactly where the pictures fit as we gradually approximate the final image.

This trying to find the underlying order within apparent randomness might be at the root of human consciousness and might be the seed goal that the network architecture of the brain is trying to achieve while existing in a world of experience. Every other thing is just a sub goal that serves this single objective function.

We can see that one of the highest levels of human endeavour which is science is a field where humans are occupied with finding the unifying factor of the world in which they are immersed in.

The scientific activity involves mostly data collection, experimentation, hypothesis generation, etc. but these are just sub level goals. No scientist gets into the field to just do data collection, that’s just something very mechanical and can even be automated. The scientist, no matter the field they are working on have as their deepest motivation the desire to make some kind of discovery, which is kind of realizing the relationship between disparate parts of the systems they are studying.

Unlike the engineer who builds things to serve a human purpose, the scientific investigator gets into things because they kind of like know beforehand that there is some underlying order, without this preconceived notion no one would get into science.

Most of the time the preconceived notion that there is some underlying order to experience is not even conscious in the scientist but it is the root drive that keeps them going. They collect data pass it through their mental pattern recognition system. Extract features in increasingly higher representations and as they come across all kinds of different data from different areas of their specialization, their perception of unification increases until at a certain point they are able to form some fundamental connection between disparate looking things and that is some discovery.

Scientist like Einstein that makes such groundbreaking discoveries of pattern integrities in nature are those who are able to extract the highest levels of representation from the data they are observing and thus they see that at such high levels even things as disparate as solid matter and pure energy are related and thus he was able to come up with his E = MC^2 equation.

With the typical classification systems we have been dealing with we can only input pieces of information that might just have spatial relationships like pixels in a picture or like characters in a word. In these systems, we have to make sure that we treat the input data as a whole, meaning that we are only able to recognize a single picture at a time or a single sentence at a time. The information we give these systems is contiguous and whole, meaning they have a terminal and our classification systems simply deal with the whole information at once in an end to end fashion.

Beyond this there are times when we are trying to obtain a representation for a sequence that is temporally related, meaning that there is a condition of certain things coming before certain other things. This is typically where we employ RNNs to come to our rescue.

I have earlier talked about RNNs as generative systems, i.e. they receive input, obtain some representation of the input as they are trained and generate output that is akin to the input they have received.

As an example, if we want to build an RNN that writes like Shakespeare we would first of all train it with every available work of Shakespeare and at inference time, depending on how well we designed the network and training procedure, we will get a system that produces text that reads like Shakespeare.

This is stunningly like what we humans would typically do when we are learning by mimicking. When we are learning from another entity we would usually observe what that entity is doing, form an internal representation of the input data we are receiving and try to produce output that is indistinguishable from what we are observing.

In the situation where we are trying to learn how to write from studying the works of Shakespeare, if we simply reproduce word for word what he has already written from memory then we have not learnt anything. We have simply memorized his works. But to learn something from Shakespeare we have to learn his style which is a higher level concept that the actual words he is using. This style is one of the things we can learn with a neural network and for scenarios where we are tasked with generating text, we would usually want to use an RNN.

Now imagine that we have learnt all the great text that is out there and we wanted to write something that was a great literary piece we would be producing text that had within it the essence of every great author.

When it comes to making projections about what will happen in the future based on what has happened in the past, we also find that RNNs are able to learn the representation of the data that had occurred in the past and using that information generates a possible piece of information that has a high probability of occurring in the future. Although we could argue that we can do this with a regular deep neural network and formulate our problem like a regression problem, but regular DNNs are not good at learning temporal relationships. They just crush their input data and produce an output, while RNNs crush their data while keeping track of temporal relationships between pieces of their data. From a bird’s eye view, these are all neural networks. We are simply modifying what elements are used and how they are connected to go from CNN to an RNN. The basic underlying element of representation learning persists.

Comments

Popular posts from this blog

Software of the future

Nigeria and the Computational Future

This powerful CEO works remotely, visits the office only a few times a year