
Description Professor Suvrit Sra gives this guest lecture on stochastic gradient descent (SGD), which randomly selects a minibatch of data at each step. The SGD is still the primary method for training large-scale machine learning systems. Summary Full gradient descent uses all data in each step. Stochastic method uses a minibatch of data (often 1 sample!). Each step is much faster and the descent starts well. Later the points bounce around / time to stop! This method is the favorite for weights in deep learning. Related section in textbook: VI.5 Instructor: Prof. Suvrit Sra