SGD the ancient optimizer
Have you ever wondered why SGD, an ancient optimizer, is still in use today despite the existence of other potent optimizers, like Adam, RMSprop, AdaGrad, and others? In this article, I’ll explain why.
Before I provide the answer, though, I’ll briefly go over the Gradient Descent; if you already know this, just scroll down to the end for the answer
Before continuing, I should note that this article is based on my domain knowledge and may be inaccurate in some ways, but its goal is simply to provide a fun fact.
SGD and things you should know
you should be familiar with the words ‘ Gradient Descent’ which is an iterative optimization algorithm commonly used in machine learning and other optimization problems. Based on Gradient Descent, there are three different versions.
Batch Gradient Descent
Batch gradient descent sums the error for each point in a training set, updating the model only after all training examples have been evaluated. This process referred to as a training epoch.
It indicates that if we have lots of data points each epoch will require a lot of time and memory, which will be a huge disadvantage if we have a lot of data. In addition, we can’t use this optimizer for online learning due to its characteristic.
Stochastic Gradient Descent
In contrast to Batch GD, the Stochastic GD runs a training epoch for each example within the dataset and it updates each training example’s parameters one at a time.
To help you understand, allow me to provide an example comparing between Batch GD and Stochastic GD. Assuming we have 100 data points, in Batch GD, we would fit all 100 datapoints into an epoch, train the model, and iterate until the loss was the lowest.
Stochastic GD, on the other hand, will train after fitting the first data point into an epoch, then fit the second data point to an update, and train up to the 100th data point.
These optimizers still need a significant amount of time and are memory-inefficient in different ways so we have the third one
Mini-Batch Gradient Descent
Mini-batch gradient descent combines both Batch GD and Stochastic GD. It splits the training dataset into small batch sizes and performs updates on each of those batches. This approach strikes a balance between the computational efficiency of Batch GD and the speed of Stochastic GD.
Note that if you read some papers that mention SGD which means Mini-Batch, take note that some papers or people frequently refer to say Stochastic GD for Mini-Batch GD.
But the SGD was only helpful in the past and used mostly in the old models; today, there are many optimizers that perform a great deal better than the SGD, and we can say that the SGD should only be mentioned in books and articles.
The goal of using SGD is to compare the new model (that we develop nowadays) to the old model; this process is known as AB testing to measure whether the new one is better than the old one. If we use a different optimizer we will unsure that the new model will perform well or simply because of the useful optimizer.
Finally, thank you for reading. If you enjoyed it, please click “like” and if there was any mistake, please let me know.