Neural Networks | Fundamentals

Here is an article in which I will try to highlight some basics and some essential concepts relating to artificial neural networks.Source: Liu ZishanDefinitionAn artificial neural network (ANN) is a series of algorithms that aim at recognizing underlying relationships in a set of data through a process that mimics the way the human brain operates. Such a system “learns” to perform tasks by analysing examples, generally without being programmed with task-specific rules.Global ArchitectureNeural networks are organized in various layers:Input layer: the input layer neurons receive the information supposed to explain the problem to be analyzed;Hidden layer: the hidden layer is an intermediate layer allowing neural networks to model nonlinear phenomena. This said to be “hidden” because there is no direct contact with the outside world. The outputs of each hidden layer are the inputs of the units of the following layer;Output layer: the output layer is the last layer of the network; it produces the result, the prediction.PerceptronThe perceptron is the first and simplest neural network model, a supervised learning algorithm invented in 1957 by Frank Rosenblatt, a notable psychologist in the field of artificial intelligence.This network is said to be simple because it only has two layers: an input layer and an output layer. This structure involves only one matrix of weights and all the units of the input layer are connected to the output layer’s.The perceptron is a linear classifier for binary predictions, in other words, it can classify or separate the data into two categories.3D representation of linearly separable dataPerceptron operationFirst of all, the simple perceptron takes n input values (x1, x2, …, xn) and is also defined by n+1 constants:n synaptic coefficients (or weights: w1, w2, …, wn);A bias: a neuron in which the activation function is equal to 1. Like the other neurons, a bias connects itself to the previous layer neurons through a weight, usually called the threshold.Each input value must then be multiplied by its respective weight (wixi), and the result of each of these products must be added to obtain a weighted sum. The neuron will then generate one of two possible values, determined by the fact that the result of the sum is lower or higher than the threshold θ.The weighted sum can be transformed into a dot product of two vectors, w (weight) and x (input), where w⋅ x = ∑wixi, then the inequality can be resolved by moving θ (threshold) to the other side.Completing all these steps produces the following architecture:Schematic representation of the simple perceptronOnce the weighted sum is obtained, it is necessary to apply an activation function. A simple perceptron uses the Heaviside step function to convert the resulting value into a binary output, classifying the input values as 0 or 1.Graphical representation of the Heaviside step functionThe Heaviside step function is particularly useful in classification tasks where input data is linearly separable.Multi-Layer PerceptronThe Multi-layer Perceptron (MLP) consists of an input layer, an output layer and one or more hidden layers. So, it is no longer a neural network like the simple perceptron, but neural networks in plural form. If the MLP has n layers, then it has n-1 weight matrices.Theoretically, adding sufficient number of neurons to the hidden layer may be enough to approximate any non-linear function.Multi-Layer Perceptron architecture:Graphical representation of the Multi-Layer PerceptronA neural network generates a prediction after passing all inputs through all layers, up to the output layer. This process is called forward propagation.Neural networks work the same way as a perceptron. So, in order to make sense of neural networks, the perceptron must be understood!Activation FunctionsThe activation function, also known as the transfer function, is an essential component of the neural network. In addition to introducing the non-linearity concept into the network, it aims at converting the signal entering a unit (neuron) into an output signal (response).The function’s name comes from its biological equivalent the “action potential”: the excitation threshold which, once reached, results in a neuron response.It is worth noting that thanks to the bias, it is possible to shift the activation function curve up or down, which means greater learning opportunities for the network.There are two types of activation functions: linear and non-linear.Linear Activation FunctionThis is a simple function which takes the form: f(x) = x. The input passes to the output with little or no modification. This is neither more nor less than a proportionality case.Graphical representation of a linear functionNon-Linear Activation FunctionsThe non-linear functions are the most commonly used and make it possible to separate data that is not linearly separable. A non-linear equation governs the correspondence between inputs and outputs.Main non-linear functions:Sigmoid:The sigmoid function (or logistic function) is an “S” curve that generates an output between 0 and 1, it is expressed as probability.Sigmoid function definitionGraphical representation of the sigmoid functionBeing a “smoother” version, it is preferred to the Heaviside step function but is not free of defects. Indeed, the sigmoid function is not zero-centered, so negative inputs might generate positive outputs. Moreover, its impact on neurons is relatively low, the result being often very close to 0 or 1, thus it leads to the saturation of some of them. Finally, the exponential function makes the process expensive as a computation.Hyperbolic Tangent (TanH):The hyperbolic tangent is a sigmoidal function like the previous one; however it usually produces better results than the logistic function due to its symmetry. Indeed, the difference is that the result of the TanH function is mapped between -1 and 1. It is generally preferred to the Sigmoid function because it is zero-centered. This function is ideal for multilayer perceptrons, especially for hidden layers.Hyperbolic tangent definitionGraphical representation of the TanH functionApart from this aspect, the Tanh function shares the same drawbacks as the Sigmoid function.Rectified Linear Unit (ReLU) :The ReLU function helps to solve the saturation problem of the above functions. It is the most frequently used.ReLU function definitionGraphical representation of the ReLU functionIf the input is negative, the output is 0, whereas if it is positive then the output is z. This activation function significantly increases the network convergence and does not saturate.However, the ReLU function is not perfect either. It is possible that the neuron remains inactive if the input value is negative, consequently the weights are not updated and the network no longer learns.Why is this activation function needed?Without a non-linear activation function, an artificial neural network, no matter how many layers it has, will behave as a simple perceptron, because summing its layers would only result in another linear function.Which function should be used?Is it a regression or classification problem? In the first case, is it a binary classification case? Ultimately, there is no better activation function, it depends on the task to be handled.Cost FunctionTo learn, the perceptron must know that it has made a mistake, as well as the answer it should have given. It is supervised learning. To do this, it is necessary to use a cost function whose purpose is to compute the error, in other words, to quantify the gap between the prediction y_hat and the expected value y. It is, therefore, necessary to minimize the cost function until the optimum: it is neural network training.To define the cost function J, the mean squared error can be used:Mean squared error (MSE)where:m is the number of training examples;y is the expected value;y_hat is the predicted value.Once the comparison between the prediction and the expected value is made, the information must be returned to the neural network, so it makes the return trip to the synapses and updates the weights. This is neither more nor less than the reverse path of the forward propagation that has been mentioned earlier. It is called backpropagation.As aforesaid, the purpose of a Machine Learning algorithm is to find a weight combination in order to minimize the cost functionFurther reading:Understanding Activation Functions in Neural NetworksGradient descentIn a case that has more weights and thus high-dimensional spaces, there is a problem: the curse of dimensionality. A naive approach like brute force can not be used anymore. It is therefore necessary to use a viable method in order to compute the cost function: the gradient descent, one of the most popular algorithms to perform optimization.Let’s assume that the cost function J is convex:Tridimensional representation of a convex functionThe horizontal axes represent the space of parameters, weight and bias, while the cost function J is an error surface above the horizontal axes. The blue circle is the initial cost value. All that remains is to go down, but which is the best way to go from here?To answer that question, some parameters have to be changed, namely weights and bias. Then, it will involve the gradient of the cost function, since the gradient vector will naturally indicate the steepest slope. It is important to know that the input values are fixed, so the weights (and bias) represent the only adjustment variables which can be controlled.Now, imagine that a ball is dropped inside a rounded bucket (the convex function), it just has to reach the bottom of it. This is optimization. In the case of the gradient descent, it will have to move from left to right in order to optimize its position.Starting from an initial position, look at the tilt angle in order to draw the tangent to this point: it means computing a derivative. If the slope is negative, the ball goes to the right, if it is positive, it goes to the left.But something is missing, and not the least because it is a hyperparameter: the learning rate (α). The slope indicates the direction to take, but it doesn’t tell how far the ball should go in that direction. This is the learning rate’s role, which determines the size of each step to reach a minimum.Putting everything together, the gradient descent can be defined as follows:Canonical formula for gradient descentwhere:θ is the model’s parameters (vector of weights);∇J(θ) is the gradient of the cost function J, in other words, this is the vector containing each of the partial derivatives. We’ll get it by differentiating the function once;α is the learning rate (step size), set before the learning process.Choosing the right value for the learning rate is important because it will impact the learning speed on the one hand, and the opportunity of finding the local optimum (convergence) on the other hand. This value, if mischosen, may favor two of the main causes of poor performance on predictive models:Overfitting: the algorithm adjusts well to the training data set, too well in reality, and that is a problem because it will no longer be able to generalize data. With a high learning rate, more distance can be covered at every step, but the risk overshooting the lowest point because the slope is constantly changing. Simply put, the loss function is fluctuating around the minimum and may even diverge;Underfitting: setting a very low learning rate makes it possible to move confidently toward the negative gradient. A low α is more precise, but computing the slope takes a lot of time and leads to painfully slow convergence.Left: α too small; Middle: decent α; Right: α too big.With a good learning rate and after a few iterations, an appropriate minimum should be found, then the ball can no longer go down.Finally, the best weights to optimize the network are determined.This is what the gradient descent is all about: knowing each weight’s contribution in the total error of the network and thus converging towards an optimized weights configuration.Stochastic gradient descentThere remains, however, a problem: the gradient descent needs the cost function to be convex. In other words, the curve is entirely above each of its tangents and the derivative of such a function is increasing over its interval. As illustrated earlier, it takes this form: ∪. But what about a function that is not convex?This time, let’s assume that the cost function J is non-convex:Tridimensional representation of a non-convex functionIn a case like this, it is no longer enough to take the steepest slope. The error surface becomes visually more complex to understand and has specific features such as local minimums and possible saddle points.The risk is therefore to lead to the blocking of some iterative algorithms, drastically slowing the backpropagation and to fall on a position that is not the smallest overall value (global minimum).In order to overcome this problem, it is possible to use the stochastic gradient descent (SGD).Mathematical formula for SGDDespite appearances, this one is faster. It provides more fluctuations of weights, which increases the chances of detecting the global minimum without stopping at a local minimum.As a matter of fact, it is not necessary to test and load the entire data set in memory to adjust the weights only at the end. The stochastic gradient descent will do it after each iteration, making the process lighter as a computation. Furthermore, the standard (or batch) gradient descent is deterministic: if the starting conditions (weights) are always the same, then the result will always be the same as well.Further readings:AI Notes: Parameter optimization in neural networks - deeplearning.aiUnderstanding the Mathematics behind Gradient Descent.Neural Networks TrainingInitialize the weights with values close to (but different from) 0;Send the first observation in the input layer, with one variable per neuron;Forward propagation: neurons are activated in a manner which depends on their assigned weights. Spread activations until y_hat prediction is obtained;Compare the prediction with the expected value and measure the error with the cost function;Backpropagation: the error spreads again in the network. Update the weights according to their responsibility in the error, and adjust the learning rate;Repeat steps 1 to 5 and adjust weights after each observation, or after a batch of observations (batch learning);When all the data set has passed through the neural network, it is called an epoch. Repeat more epochs.Further reading:AI Notes: Initializing neural networks - deeplearning.aiKey TakeawaysThe perceptron is the first and simplest artificial neural network model. ANNs work the same way;A neural network consists of an input layer, one or more hidden layers, and finally an output layer;The input values are fixed, the synaptic coefficients (weights) and the bias are the only parameters which can be controlled;The more layers a neural network has, the deeper it is, but multiplying them can be counterproductive;There are other types of artificial neural networks such as convolution neural networks (CNN or ConvNet), or recurrent neural networks (RNN) to name a few;A cost function is used to quantify the gap between the prediction and the expected value. This must be minimized thanks to an optimized combination of weights (neural network training);Gradient descent is an optimization algorithm and by far the most common way to train neural networks;Knowing the direction and the size of each step in that direction (learning rate) is the key to perform a gradient descent. The last one has to be set carefully;Overfitting and underfitting are two of the main reasons of poor performance on predictive models.Further resources:Coursesdeeplearning.ai, Andrew Ng’s introductory deep learning course;CS231n: Convolutional Neural Networks for Visual Recognition, Stanford’s deep learning course;fast.ai, a hands-on project-based course.ReadingDeep Learning, an online version of the book which is widely considered as the “Bible” of Deep Learning, authored by Ian Goodfellow, Yoshua Bengio, and Aaron Courville;Neural Networks and Deep Learning, a free, clear and accessible textbook by Michael Nielsen;Deep Learning Papers Reading Roadmap, a compilation of key papers organized by chronology and research area.Neural Networks | Fundamentals was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Comments

There's unfortunately not much to read here yet...

Discover the Best of Machine Learning.

Ever having issues keeping up with everything that's going on in Machine Learning? That's where we help. We're sending out a weekly digest, highlighting the Best of Machine Learning.

Join over 900 Machine Learning Engineers receiving our weekly digest.

Best of Machine LearningBest of Machine Learning

Discover the best guides, books, papers and news in Machine Learning, once per week.

Twitter