
Description In this lecture, Professor Strang presents Professor Sra’s theorem which proves the convergence of stochastic gradient descent (SGD). He then reviews backpropagation, a method to compute derivatives quickly, using the chain rule. Summary Computational graph: Each step in computing \(F(x)\) from the weights Derivative of each step + chain rule gives gradient of \(F\). Reverse mode: Backwards from output to input The key step to optimizing weights is backprop + stoch grad descent. Related section in textbook: VII.3 Instructor: Prof. Gilbert Strang