Neural Networks, broadly defined, part-1

In this blog post, we will dig into the basics of a neural network. We will learn about logistic regression and some other things. Let us get started.Advantage of Random ForestsOne of the best benefits of the random forest is that we do not need to normalise our data. All that matters to the random forest is the order of the variables and where we split the values. Once we sort our variables using various random forest interpretation techniques, then random forests do not look for the values of the variables, it only searches for the order of the variables.But in neural networks, we need to normalise the data because of a large number of complex processes that undergo in the neural networks, which may result in the enormous numbers. And it is always easy to handle small values that to deal with huge values. Therefore we generally normalise the data in the neural networks.How to normalize the data ❔We grab the mean and the standard deviation of our training data and subtract out the mean, divide by the standard deviation, and that gives us a mean of zero and standard deviation of one. We repeat the same on the validation dataset but we use mean and standard deviation of training dataset to maintain the uniformity. You need to make sure anything you do in the training set, you do exactly the same thing in the test and validation set. So in general, if you find you try something on a validation set or a test set and it’s much much much worse, than your training set, that’s probably because you normalized in an inconsistent way. So basically it’s much easier to create a neural network architecture that works for lots of different kinds of inputs if we know that they are consistently going to be mean zero standard deviation one i.e we have a defined scale for our input values.mean = x.mean()(where x is our training dataset)std = x.std()x=(x-mean)/stdA disadvantage of Random ForestsThough Random forests are magnificent in predicting the data, they are limited to predicting data which it has sen before. It cannot predict the future data, or in simpler terms, the random forest cannot extrapolate. If your dataset contains a huge time-series component, then indeed random forest is a wrong choice. I am saying that you cannot use a random forest at all, but it will not give you satisfactory results. This is where NEURAL NETWORKS come into the picture.What are Neural Networks ❔In the most basic terms, a neural network consists of a large number of layers where each layer represents a mathematical function and result at each layer follows another function which is known as the activation function. There are so many layers defined in the neural networks such that if we look from the top, we will receive a complex web of layers. It is this power of neural networks that it can predict any type of related relationship so arbitrarily close. A lot of research has ended up the concept of neural networks.Now in neural network terms, neural networks consist of a large number of layers which have input and output neurons where each layer represents a type of mathimatical function followed by the non-linear function which ultimately gives us the result. So the definition of a neural network is a mathematical function followed by an activation function followed by a mathematical function followed by an activation function, etc.A broader sense of a neural networkWe take function to train our neural network. We multiply the function with the matric consisting of weights followed by some activation function to remove the linearity if exits in the input function. To remove the linearity from the function, we replace the negative values with 0. This idea of converting the linear function to non-linear one is the result of many years of research in this field.Now multiply the above step infinity times, and that is a neural network, broadly defined.They’re linear functions that we add together, but the more general term for these things that are more general than matrix multiplications as affine functions. So when you see affine function, it just means a linear function. It means something very very close to matrix multiplication. And matrix multiplication is the most common kind of affine function, at least in deep learning.There are basically two types of numbers broadly defined in the neural network.Parameters/weights — These are the numbers that our neural network learns about. More definite our parameters will be, more accurate our results will be, ensuring that our model rather than learning to predict the things in general, crams our input and learn in completely which leads to overfitting.Activations — It is the output that we get from multiplying the input with the weigh matrices i.e activations are numbers but these are numbers that are calculated. Remember, activations don’t only come out of matrix multiplications, they also come out of activation functions or non-linear functions like ReLU, softmax and sigmoid, etc. Our input is also a type of activation but is not calculated.In simpler terms, if you’re multiplying things together and adding them up, it’s an affine function. It’s linear. And therefore if you put an affine function on top of an affine function, that’s just another affine function. So you need to use it with any kind of non-linearity pretty much works — including replacing the negatives with zeros which we call ReLU. So if you do affine, ReLU, affine, ReLU, affine, ReLU, you have a deep neural network.What is universal approximation theorem ❔If you have big enough weight matrices and enough of them, it can solve any arbitrarily complex mathematical function to any arbitrarily high level of accuracy (assuming that you can train the parameters, both in terms of time and data availability and so forth). This is the basis of a complete neural network that we see today.Let us take an example to understand it better.We want to predict the values of a and b in the equation of line ax+b that represents the below set of points.We will start with a random set of variables. We will increase and decrease the magnitude of variables to represent the above set of points. Now increasing and decreasing the amplitudes solely depends on the derivative function. At any point, the derivative of the function at the point tells us if we have to increase or decrease the magnitudes to fit the above set of points.Logistic Regression, a single neural network layerimport torch.nn as nnnet = nn.Sequential( nn.Linear(n*n, m), nn.LogSoftmax())We are constructing a sequential two-layer neural network, also known as logistic regression, the world’s most straightforward neural network. So no hidden layers. LR has LogSoftmax as its activation function. Let us understand each layer.nn.Linear(n*n, m) — it tells to linearly train the neural network where input will be of size n*n and output will be of size m. Lets take an example, in case of MNIST dataset where we want to predict a number, n will be 28 and m will be 10.Often for classification problems (like MNIST digit classification), the final layer has the same number of outputs as there are classes. In that case, this is 10: one for each digit from 0 to 9. We have one hot encoded our dependent variable.nn.LogSoftmax() — it is non-linear conversion of the linear values which is also known as activation function.Now, since we have our neural network layer, we decide the loss function, what optimiser we want and lastly, we fit our model.loss=nn.NLLLoss()metrics=[accuracy]opt=optim.Adam(net.parameters())fit(net, md, n_epochs=1, crit=loss, opt=opt, metrics=metrics)[md: ModelData object which wraps wraps up training data, validation data, and optionally test data.,e_pochs: number of times to run over the single data,opt: optimizer function,metrics: what is your metric as declared above.]As a fastai learner, the library provides a very handy function ImageClassifierData which creates model data for us in no time. It is declared as:md = ImageClassifierData.from_arrays(path, (x,y),(x_valid, y_valid))[x, y: training dataset,x_valid, y_valid: validation dataset]What is a Loss function ❔Loss function tells us about the amount of loss in our predictions. It is very much similar to the information gain concept in case of binary trees where we find how much information we have gathered to predict the values in terms of root mean square log error. In neural networks, generally, there are two types of loss functions, i.eBinary loss function: which is used when we have to predict in terms of binary values, i.e. in terms of 0 and 1.Categorical loss function: When we have to predict a loss for few categories like if the image is of dog, cat, horse or elephant or image classification on MNIST dataset.We generally use CrossEntropyLoss as our loss function & softmax as our non-linear function at the end for the classification problems and RMSE & softmax for regression problems. That is just from personal * np.log(p) + (1-y)*np.log(1-p)))where y is the set of actual values anf p is the set of predicted values.In the next part of this tutorial, I will focus on building Logistic Regression from scratch and will try to understand the terms used above in better way.Neural Networks, broadly defined, part-1 was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.


There's unfortunately not much to read here yet...

Discover the Best of Machine Learning.

Ever having issues keeping up with everything that's going on in Machine Learning? That's where we help. We're sending out a weekly digest, highlighting the Best of Machine Learning.

Join over 700 Machine Learning Engineers receiving our weekly digest.

Best of Machine LearningBest of Machine Learning

Discover the best guides, books, papers and news in Machine Learning, once per week.