问题提出

大多数机器学习或者深度学习算法都涉及某种形式的优化，即通过改变 $x$ 以最小化损失函数 $f(x)$ 。

小批量随机梯度下降（mini-batch stochastic gradient descent）是Dive into Deep Learning中提出的第一个优化方法，3.2节线性回归的从零开始实现中要求实现一个sgd函数以完成参数的优化。因此记录以下关于理解梯度下降的笔记。

梯度下降

梯度下降法主要分为以下三种。

批量梯度下降BGD

首先假设一个损失函数：

$\begin{equation} J(\theta)=\frac{1}{2}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})^2 \end{equation}$

其中 $h_{\theta}(x)$ 表示预测值，由以下函数得到：

$\begin{equation} h_{\theta}(x)=\theta_0+\theta_1x_1+\theta_2x_2+...+\theta_nx_n \end{equation}$

根据高等数学的知识，曲面上方向导数最大值的方向就是梯度的方向。因此在进行梯度下降的操作时，只需沿着梯度的反方向进行权重的更新，最终就能找到全局的最优解。对于一个样本点 $(X, y)$ 的各参数 $\theta_i$ ，权重的更新可以表示为：

$\begin{equation} \begin{aligned} \theta_j'&=\theta_j-\alpha \frac{\partial}{\partial\theta_j}J(\theta)\\ &= \theta_j-\alpha(h_\theta(x)-y)x_j \end{aligned} \end{equation}$

其中， $\alpha$ 表示每次沿反方向进行权重更新时的步长，也就是超参中的学习率（learning rate）。

上述公式表示了一个样本点的更新方式，对于全部样本，采用取平均的方式更新：

$\begin{equation} \theta_j'=\theta_j-\alpha\frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_j \end{equation}$

BGD更新每个参数时都会使用全部样本，在样本量非常大时，效率较低。

随机梯度下降SGD

为了解决BGD训练速度随样本量增加减慢的问题，提出了随机梯度下降的方法。在随机梯度下降的每次迭代中，我们随机均匀采样的一个样本索引 $i\in \{1,2,3,...,n\}$ ，并计算该样本的损失函数梯度 $\frac{\partial}{\partial\theta_j}J(\theta)$ 来更新参数 $\theta_j$ ：

$\begin{equation} \theta_j'=\theta_j-\alpha(h_\theta(x^{(i)})-y^{(i)})x_j \end{equation}$

由于损失函数随机梯度是对其梯度的无偏估计，因此平均来说，随机梯度是对梯度的一个良好的估计。

SGD伴随的一个问题是噪音较BGD要多，使得SGD并不是每次迭代都向着整体最优化方向，但是大的整体的方向是向全局最优解的，最终的结果往往是在全局最优解附近。

小批量随机梯度下降MBGD

该方法是对上述两种方法的折衷，每次随机选取样本时，样本数从1个增加至batch_size个：

$\begin{equation} \theta_j'=\theta_j-\alpha\frac{1}{|\Beta|}\sum_{i=1}^{|\Beta|}(h_\theta(x^{(i)})-y^{(i)})x_j \end{equation}$

这样既保证了算法的训练速度，而且也要保证了最终参数训练的准确率。

示例代码如下：

for epoch in range(epochs):
    # mini-batch stochastic gradient descent
    for X, y in data_generation(batch_size, features, labels):
        l = squared_loss(model(X, w, b), y)  # declare loss function on w, b
        l.sum().backward()  # calculate loss function's gradient of 10 samples
        with torch.no_grad():
        # disable gradient calculation
          for param in [w, b]:
              param -= lr * param.grad / batch_size
              param.grad.zero_() # clear gradient
    with torch.no_grad():
        train_l = squared_loss(model(features, w, b), labels)
        print(f"epoch {epoch + 1}, loss {float(train_l.mean()):f}")