Explaining the relationship between machine learning and artificial intelligence is one of the most challenging concepts that I encounter when talking to people new to these topics. I don’t pretend to have the definitive answer, but, I have developed a story that seems to get enough affirmative head shakes that I want to share it here.
The diagram above has appeared in many introductory books and articles that I’ve seen. I have reproduced it here to highlight the challenge of talking about “subsets” of abstract concepts – none of which have widely accepted definitions. So, what does this graphic mean or imply? How is deep learning a subset of artificial intelligence? These are the questions I’m going to try to answer by telling you a story I use for briefings on artificial intelligence during the rest of this article.
Since so many people have read about and studied examples of using deep learning for image classification, that is my starting point. I am not however going to talk about cats and dogs, so please hang with me for a bit longer. I’m going to use an example of facial recognition. My scenario is that there is a secure area in a building that only 4 people (Angela, Chris, Lucy and Marie) are permitted to enter. We want to use facial recognition to determine if someone attempting to gain access should be allowed in. You and I can easily look at a picture and say whether it is someone we know. But how does a deep learning model do that and how could we use the result of the model to create an artificial intelligence application?
I frequently use the picture below to discuss the use of deep neural networks for doing model training for supervised classification. Now when looking at the network consider that the goal of all machine learning and deep learning is to transform input data into some meaningful output. For facial recognition, the input data is a representation of the pixel intensity and color or grey scale value from a picture and the output is probability that the picture is either Angela, Chris, Lucy or Marie. That means we are going to have to train the network using recent photos of these four people.
A highly stylized neural network representation
This picture above is a crude simplification of how a modern convolutional neural network (ConvNet) used for image recognition would be constructed, however, it is useful to highlight many of the important elements of what we mean by transforming raw data into meaningful outputs. For example, each line or edge drawn between the neurons of each layer represent a weight (parameter) that must be calculated during training. These weights are the primary mechanism used to transform the input data into something useful. Because this picture only includes 5 layers with less than 10 nodes per layer it is easy to visualize how fully connected layers can quickly increase the number of weights that must be computed. The ConvNets in wide spread use today typically have from 16 to 200 or more layers, although not all fully connected for the deeper designs, and can have 10's of millions to 100’s of millions of weights or more.
We need that many weights to “meaningfully” transform the input data since the image is broken down into many small regions of pixels (typically 3x3 or 5x5) before getting ingested by the input layer. The numerical representation of the pixel values is then transformed by the weights so that the output of the transformation indicates if this region of cells adds to the evidence that this is a picture of Angela or negates the likelihood that this is Angela. If Angela has black hair and the network does not detect many regions of solid black color, then there not be much evidence that this picture is Angela.
Finally, I want to tie everything discussed so far to an explanation of the output layer. In the picture above, there are 4 neurons in the output layer and that is why I setup my facial recognition story to have 4 people that I am trying to recognize. During training I have a set of pictures that have been labeled with the correct name. One way to look at how I might do that is like this:
Table 1 - Representation of labeled training data
The goal during training is to come up with a single set of weights that will transform the data from every picture in the training data set into a set of four values (vector) for each picture where the values match as close as possible to the labels assigned as above. For Picture1 the first value is 1 and the other three are zeros and for Picture2 the set of 4 training values are set to zero for the first 3 elements and the fourth value is 1. We are telling the model that we are 100% sure (probability = 1) that this is a picture of Angela and certain that it is not Chris, Lucy, or Marie (probability = 0). The training process tries to find a set of weights that will transform the pixel data for Picture1 in to the vector (1,0,0,0) and Picture2 into the vector (0,0,0,1) and so on for the entire data set.
Of course, no deep learning model training algorithm can do that because of variations in the data so we try to get as close as possible for each input image. The process of testing a model with known data or processing new unlabeled images is called inferencing. When we pass in unlabeled data we get back a list of four probabilities that reflect the evidence in the data that the image is one of the four know people, for example we might get something back like (.5, .25, .15, .1). For most classification algorithms the set of probabilities will add to 1. What does this result tell us?
Our model says we are most confident that the unlabeled picture is Angela since that is the outcome with the highest probability, but, it also tells us that we can only be 50% sure that it is not one of the other three people. What does it mean if we get an inference result back that is (..25, .25, .25, .25)? This result tells us the model can’t do better than a random process like picking a number between 1 and 4. This picture could be anyone of our known people or it could be a picture of a truck. The model provides us no information. How intelligent is that? This is where the connection with artificial intelligence gets interesting.
What we like to achieve is getting back inference predictions where one value is very close to 1 and all the others are very close to zero. Then we have high confidence that person requesting access to a restricted area is one of our authorized employees. That is rarely the case, so we must deal with uncertainty in our applications that use our trained machine learning models. If the area that we are securing is the executive dining room then perhaps we want to open the door even if we are only 50% sure that the person requesting access is one of our known people. If the application is securing access to sensitive computer and communication equipment, then perhaps we want to set a threshold of 90% certainty before we unlock the door. The important point is that machine learning is usually not sufficient alone to build an intelligent application. Therefore, fear that the machines are going to get smarter than people and therefore be able to make “better” decisions is still a long way off, maybe a very long way…