Stochastic Gradient Descent (SGD) in Machine Learning

In machine learning, we have a training set and we train the model in multiple times using the training set. Often, known as training cycles or EPOCHs

Training a model is very Compute Intensive

In Large scale Production models, There could be thousands of features, and training data sets often contain billions or even hundreds of billions of examples. So, A very large batch may cause even a single iteration (training cycle) to take a very long time to compute

Furthermore, A large data set with randomly sampled examples probably contains redundant data. In fact, redundancy becomes more likely as the batch size grows. Some redundancy can be useful to smooth out noisy gradients, but enormous batches tend not to carry much more predictive value than large batches

Reducing the training examples:

Sometimes, Computing gradient on small data samples works well as long as in every step, you get a new random sample from the training set. Stochastic gradient descent (SGD) takes this idea to the extreme--it uses only a single example (a batch size of 1) per iteration. Given enough iterations, SGD works but is very noisy. The term "stochastic" indicates that the one example comprising each batch is chosen at random for training the model