Introduction
Deep learning is a type of machine learning that trains computers to imitate human abilities such as speech and image recognition. For example, deep learning is the central technology behind driverless cars, enabling them to recognize stop signs, pedestrians, and traffic lights. The basic idea of deep learning is that it enables computers to recognize patterns based on data observed in the past and use the knowledge of the past to predict the future. What is powerful about deep learning is that it does not use any predefined rules to produce outputs. Instead, deep learning is able to learn hidden patterns from the sample data, combine them together, and build efficient inference rules that can be used for real data. In other words, it allows computers to automatically learn the features that characterize the data. The underlying mechanism that makes deep learning possible is called Artificial Neural Networks(ANNs), sometimes also known as Neural Networks(NNs). These ANNs are designed to imitate how humans think and learn by representing complicated concepts and relationships of data in a way that computers can understand.
Deep learning is very popular because the dynamic nature of deep learning methods presents a great opportunity for companies to leverage the power of big data, enabling them to reduce cost, increase profit, and make faster, better decisions.
There is no shortage of articles online that attempt to explain how deep learning works, but few that include an example with actual numbers. In this post, I will discuss how deep learning works by focusing on Feedforward Neural Networks with a concrete example that folks can compare their own calculations in order to ensure they understand it correctly. This post is the first in the series of three. A background in advanced mathematics would be helpful but is not required.
Feedforward Neural Networks
Artificial Neural Networks (ANNs) consist of a large number of simple elements, called neurons, which are also known as nodes, that can make simple decisions. There are many types of neural networks, such as Convolutional Neural Networks(CNNs), Modular Neural Networks(MNNs), and Feedforward Neural Networks(FNNs). They differ in the organization of nodes and the types of problems that they address. For instance, Convolutional Neural Networks uses 2D convolutional layers, making this architecture well suited to processing 2D data, such as images. You may read this article to learn more about the different types of neural networks. However, we will only focus on FFNs because it is easier to understand.
Feedforward Neural Network(FNN) is a particular type of neural network where all nodes are connected and activation flows from the input layer to output, without back loops. These nodes are organized into layers. The collections of nodes that are on the same depth are said to be on the same layer. A node is simply a function that takes inputs and produces an output. While it might look simple in construct, together, these nodes can provide accurate answers to some complex problems, such as natural language processing, computer vision, and voice recognition. The goal of an FNN is to approximate some function f*. For example, the function f* maps input x to a category y, where x can be images, text, or sound. In a typical FFN, there are three types of layers in a neural network. First, the input layer contains the raw data(i.e. Pixel on an image). Then, the hidden layer(s) are where the black magic happens in neural networks. Each layer is trying to learn different aspects of the data by minimizing a cost function(see Appendix). Lastly, the output layer is the simplest, usually consisting of a single output for classification problems. Although it is a single 'node' it is still considered a layer in a neural network as it could contain multiple nodes.
How does a Node Work
Feedforward Neural Networks(FNNs) comprise layers of nodes, much like how the human brain is made up of neurons. The basic unit of computation in the network is the neuron, often called a node or unit. There are five parts that make up a node, namely input values, weights, a bias, an activation function, and an output value.
A node is basically a function that takes inputs from external sources(i.e. other nodes) and transforms the inputs into a single output value. There is no limit on how many inputs the node can take. These inputs are real numbers that represent aspects of the data that are relevant to the node. In image 2, there are three input values for the node, namely X1, X2, and X3. This mechanism vaguely resembles how neurons receive signals from other neurons through synapses. For example, the inputs could be the pixels for an image that is to be classified, where each input corresponds to the value of a single-pixel on the image. Then, each input is associated with a weight. In the above diagram, there are three weights, corresponding to the three input values, namely W1, W2, and W3. Weights are real numbers that represent the relative importance of the input. A heavier weighted node will exert more effect on the next layer of nodes. The weights are applied to the associated input value by multiplying the weight with the input value(e.g. w1*x1). Next, each node has a bias value(the b in image 2). The bias helps to adjust the output of the activation function and provide every node with a trainable constant value. Lastly, the node evaluates the weighted-sum(z in the above diagram) of its input with an activation function(f(z) in the above diagram) to produce an output value(the a in image 2). The purpose of the activation function is to help determine whether the particular node should be activated (“fired”) or not, based on whether each node’s input is relevant for the model’s prediction. Activation functions also help normalize the output of each neuron to a range between 1 and 0 or between -1 and 1. This is useful because 0 or -1 can be interpreted as false, which means the node did not ‘fire,’ vice versa. In practice, many activation functions are available but the most common ones are Sigmoid, TanH, and ReLU because they are relatively easy to differentiate which is necessary for a learning algorithm called Backpropagation.
Popular activation functions
Example
If the above explanation confuses you, then the following example might help you better understand how a node works. In this example, we will look at a node that can recognize Ragdoll (a breed of cat) from pictures. Be warned that neural networks for image recognition typically require a lot more nodes, this example is over-simplified for learning purposes.
The node has three inputs, namely X1, X2, and X3. In this case, let's say that these input values came from nodes on the previous layer of the network, and we can safely ignore where they came from because each node is independent of the others. Let’s then assume that these inputs correspond to the following ideas: X1 is a value between 0 and 1 that represents how likely the given picture is a four-legged animal with fur, X2 is a value between 0 to 1 that represents the cuteness of the animal, and X3 is a value between 0 and 1that represents how likely that the given animal has triangular ears. These high-level ideas are called features. They are typically extracted by data scientists to help the computers categorize and learn about the data. The next step is to compute the weighted sum of the input values. The formula is very simple: we simply multiply each input value with the corresponding weight to get a value that is known as the net value for the node. Also note that this node does not have a bias, which means we have b = 0. Therefore, the weighted-sum for this node given the current input values is 0.9(2) + 0.7(1) + 1(1) = 3.5. The last step before we produce an output value is to evaluate the weighted-sum with the activation function. As indicated in the diagram, the activation function for this node is sigmoid. Using a calculator, we can evaluate the following equation where 3.5 is the weighted sum that we calculated in the previous step.
As such, 0.9706 is the final output for the node. In this case, it represents how certain the node believes that the given picture is a Ragdoll. Recall that activation functions help to normalize the output of the node. Therefore, this value means that the node is 97% certain that the given picture is a Ragdoll. As a side note, this node did not have a bias, but if we do, we would simply do output = sum (weights * inputs) + bias and then evaluate the activation function with the output.
You might be curious to know how the weights are assigned such that the computer knows how to allocate more weight for X1 and X3 but less for X2. Being able to determine the correct weights to use is what enables computers to learn from data. Imagine how the weights can be adjusted to produce a more accurate output if the node goes through a lot of pictures(with or without Ragdoll) and compares the output that it produces to the human-generated label(whether the picture is a Ragdoll or not). If there is a match, then the output is confirmed. If not, the neural network notes the error and adjusts the weightings to try to give the correct answer next time. For example, it might be the case that, after training with a lot of ragdoll pictures, the computer realizes that cuteness is more important than any other features combined. As a result, X2 is assigned with a heavier weight, perhaps something like 3. This process of finding locally optimal weights for each input can be done automatically through training with an algorithm called Backpropagation for a multi-layer neural network, and this is a topic that we cover more in-depth in the next post.
Summary
Deep Learning is a subfield of machine learning concerned with algorithms inspired by the structure and function of the brain called Artificial Neural Networks(ANNs). Neural networks are mathematical constructs that generate predictions for complex problems. The basic unit of a neural network is a neuron, and each neuron serves a specific function. There are five parts that make up a node, namely input values, weights, a bias, an activation function, and an output value. ANNs are composed of a large number of such nodes. These nodes are organized into layers. The collections of nodes that are on the same depth are said to be on the same layer. There are three layers in a typical Feedforward Neural Network: input, hidden, and output. Additionally, finding the optimal weight for each input is how neural networks learn from data.
Question:
What is the meaning of ‘deep’ in deep learning?
References
Appendix
Cost function: In order to minimize error(i.e. false-positive, false-negative), we use the concept of gradient descent. As explained here (read this if you aren’t familiar with how gradient descent works), gradient descent calculates the slope of the cost function, then shifts the weights and biases according to that slope to a lower loss. A popular cost function that works well is shown below, where Y_i is the expected output and O_i is the output value.
Comments