How to tune hyperparameters for better neural network performance

With an example

SangGyu An
CodeX

--

Photo by Markus Spiske on Unsplash

By now, you would know that the MLP is a flexible approach that contains lots of variables. In the previous post, we talked about adjusting the parameters to perform different analyses. In this post, we’re going to talk about general approaches to tuning hyperparameters for better performance.

How to choose a number of hidden layers

One of the hyperparameters that change the fundamental structure of a neural network is the number of hidden layers, and we can divide them into 3 situations: 0, 1 or 2, many.

First, you won’t need any hidden layer if a data set is linearly separable. In fact, you don’t need to use the neural network at all if all you need is a linear boundary since the neural network is for solving complex problems.

Second, if a data set isn’t linearly separable, then you need a hidden layer. And normally, a single hidden layer is sufficient because the amount that a model improves by adding hidden layers isn’t significant compared to the additional work you need to do. So in many practical settings, one or two hidden layers do their job.

Lastly, if you are trying to solve a complex problem such as object classification, then you need multiple hidden layers that apply different modifications to their inputs. We will talk more deeply about this in future posts.

Summary of the number of hidden layers

How to choose a number of neurons

The next thing you should choose is the number of neurons you are going to include in the hidden layer. Finding an appropriate number is critical because too few neurons can lead to underfitting whereas too many can lead to overfitting plus longer training time. Empirically, it’s best to use a number between input and output sizes, and the number changes based on how complex your problem is.

If a problem is simple and the input and output relationship is clear, then about ⅔ of the input size can be a good starting point. But if the relationship is complex, the number can vary from the input size to less than twice the input size.

Summary of the number of neurons in a hidden layer

It seems vague, but unfortunately, there is no definite answer that you can follow since the neural network is still an active research area and every parameter is unique to each problem. So you should just consider these as starting points and need to go through trial and error to find which value works the best for your specific problem.

How to choose batch size, learning rate, epoch

Lastly, we will look at hyperparameters that are related to training time and performance.

When batch size increases, each batch naturally becomes similar to the full data set because each batch starts to contain more observations. This means that each batch will not differ too much from others. Therefore, its noise will decrease, so it’s logical to use a large learning rate for faster training time. In contrast, when we use a small batch size, noise increases. Thus, we use a small learning rate to offset the noise. So which batch size should we use? People are still researching, but we can find the answer from others’ experiences.

Empirically, it has been shown that a large batch size could lead to poor generalization. In contrast, when we use a small batch size, the noise helps a network to escape a local minimum and leads to higher accuracy. It also tends to converge to a reasonable solution faster than a network with large batch size. So, in general, a batch size of 32 could be a good starting point, but this number really depends on your sample size, the complexity of a problem, and your computational environment. Therefore, using a grid search could also be appropriate.

For the learning rate, we usually start with 0.1 or 0.01 or we can use a grid search from 0.1 to 1e-5. And when the learning rate is small, you need more iteration to find a minimum point. Thus, more epochs are required but by how much?

Depending on a problem and random initialization, the number of epochs you need for convergence varies. Therefore, there is no magic number of epochs that works in every situation. So in practice, we often set the number of epochs high and use early stopping so that the neural network stops training when an improvement coming from updating its weights does not pass a threshold.

Summary of learning rate, epoch, batch size

Tuning hyperparameters example

Using these as starting points, let’s revisit the algorithm that tells me which song I would like. Back in July, I used the logistic classifier with the lasso and achieved 47.8% precision with 31.4% recall. So the f1-score was 0.379. Let’s see how we can improve these by using the neural network.

First, we need to choose how the hidden layer is structured. For this problem, one hidden layer would be sufficient because the problem isn’t linear and neither complex as computer vision. And 6 neurons could be a good starting point since a total of 10 features enter into the neural network. Then for the output layer, I only need one output neuron that uses the logistic (=sigmoid) activation function since I’m trying to solve a binary classification.

Code for the basic structure / Basic structure of the mlp

Now that I have the basic structure of the neural network, I need to tune the hyperparameters. I can start with the general starting point 0.1 learning rate, but it’s better to try out a grid of values and select the best one. So I implemented the gridsearchCV from the sklearn package to test which combination of learning rate among 0.1, 0.01, 1e-3, 1e-4, 1e-5 and batch size among 10, 20, 40, 60 suits the best to this problem. And for epoch, instead of a fixed amount, I used early stopping that stops training when there is no improvement within 10 iterations and restores the best model.

Code for grid search CV and early stopping

One thing to take a deeper look at is the loss function. Since I’m dealing with an imbalanced data set, using cross-entropy isn’t appropriate. It makes more sense to compare f1-score that puts a higher weight on what I care about: true positives. Unfortunately, I can’t just use f1-score since it isn’t differentiable. Thus, I need to modify the f1-score so that it becomes differentiable. However, the sklearn package does not provide this, so I borrowed a code written by Michal Haltuf from Kaggle.

Code for f1_loss written by Michal Haltuf
Best hyperparameters

As you can see from the result, the best learning rate is 0.1 with a batch size of 60. And using the function provided by the sklearn, I can extract the history of model training with those hyperparameters.

Training history of a learning rate of 0.1 with a batch size of 60

As shown above, prediction increases but recall and f1-loss decrease. When I use this model for prediction, it shows that it can detect 8 TP and 3 FP from the test set. So about 73% of positive outcomes are true positives. Compared to last time, which was about 48%, I see this as an improvement. And when we look at the f1-score, this model has a 0.192 higher f1-score than the logistic model.

Test result of the best mlp network

Reflection

One thing to keep in mind is that this result changes every time when I run a model.

Trial 1
Trial 2

This is because weights and optimization contain some level of randomness. One of the solutions is to repeat the prediction several times and calculate statistics of those results.

Code for 30 repetitions / average statistics of the 30 repetitions

Thus, I repeated, and the statistics show that each precision and recall are about 52.3% and 69.4%. This gives us the f1-score of about 0.596, which is about 0.217 improvement from using the logistic classifier.

So the neural network does show some improvements, but why can’t we get a higher f1-score?

I believe one of the reasons is the small sample size. Compared to other neural network projects, 300 is an extremely small data set. Within this data set, I had to divide the training, validation, and test sets. Thus, the network didn’t have an opportunity to learn more from the data set. However, it isn’t always possible to collect a large amount of data by myself. In this case, I can use a network built by others and apply transfer learning, which will be a topic for the next post.

Reference

[1] Bengio, Y. (2012, September 16). Practical recommendations for gradient-based training of Deep Architectures. arXiv.org. Retrieved November 12, 2021, from https://arxiv.org/abs/1206.5533.

[2] Brownlee, J. (2019, August 6). How to configure the learning rate when training deep learning neural networks. Machine Learning Mastery. Retrieved November 12, 2021, from https://machinelearningmastery.com/learning-rate-for-deep-learning-neural-networks/.

[3] Dernoncourt, F. (2016, September 22). What is the trade-off between batch size and number of iterations to train a neural network? Cross Validated. Retrieved November 12, 2021, from https://stats.stackexchange.com/questions/164876/what-is-the-trade-off-between-batch-size-and-number-of-iterations-to-train-a-neu.

[4] doug. (2010, August 2). How to choose the number of hidden layers and nodes in a feedforward neural network? Cross Validated. Retrieved November 12, 2021, from https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw.

[5] Haltuf, M. (2018, October 19). Best loss function for F1-score metric. Kaggle. Retrieved November 12, 2021, from https://www.kaggle.com/rejpalcz/best-loss-function-for-f1-score-metric.

[6] Heaton, J. (2009). Introduction to neural networks with Java. Heaton Research.

[7] Keim, R. (2020, January 31). How many hidden layers and hidden nodes does a neural network need? — technical articles. All About Circuits. Retrieved November 12, 2021, from https://www.allaboutcircuits.com/technical-articles/how-many-hidden-layers-and-hidden-nodes-does-a-neural-network-need/.

[8] Michaus, M. (2017, November 5). Visualizing learning rate vs batch size. Learning on Machine Learning — My 2 cents. Retrieved November 12, 2021, from https://miguel-data-sc.github.io/2017-11-05-first/.

[9] Stewart, M. (2019, July 9). Simple guide to hyperparameter tuning in Neural Networks. Medium. Retrieved November 12, 2021, from https://towardsdatascience.com/simple-guide-to-hyperparameter-tuning-in-neural-networks-3fe03dad8594.

--

--