矩阵求导
定义:分子布局与分母布局
令$x,y$表示标量,$\mathbf {x,y}$表示向量,$\mathbf {X,Y}$表示矩阵。
| 分子布局 | 分母布局 | |
|---|---|---|
| 标量$y\in \mathbb R$对向量$\mathbf x\in \mathbb R^M$求导 | $\frac{\partial y}{\partial\mathbf x}=[\frac{\partial y}{\partial x_1},\cdots,\frac{\partial y}{\partial x_M}]$ | $\frac{\partial y}{\partial\mathbf x}=[\frac{\partial y}{\partial x_1},\cdots,\frac{\partial y}{\partial x_M}]^\mathrm T$ |
| 向量$\mathbf y\in\mathbb R^N$对标量$x\in\mathbb R$求导 | $\frac{\partial\mathbf y}{\partial x}=[\frac{\partial y_1}{\partial x},\cdots,\frac{\partial y_N}{\partial x}]^\mathrm T$ | $\frac{\partial\mathbf y}{\partial x}=[\frac{\partial y_1}{\partial x},\cdots,\frac{\partial y_N}{\partial x}]$ |
| 向量$\mathbf y\in\mathbb R^N$对向量$\mathbf x\in\mathbb R^M$求导 | $\frac{\partial \mathbf y}{\partial \mathbf x}=\begin{bmatrix}\frac{\partial y_1}{\partial x_1}&\cdots&\frac{\partial y_1}{\partial x_M}\ \vdots&\ddots&\vdots\ \frac{\partial y_N}{\partial x_1}&\cdots&\frac{\partial y_N}{\partial x_M}\end{bmatrix}$ | $\frac{\partial \mathbf y}{\partial \mathbf x}=\begin{bmatrix}\frac{\partial y_1}{\partial x_1}&\cdots&\frac{\partial y_N}{\partial x_1}\ \vdots&\ddots&\vdots\ \frac{\partial y_1}{\partial x_M}&\cdots&\frac{\partial y_N}{\partial x_M}\end{bmatrix}$ |
| 矩阵$\mathbf Y\in\mathbb R^{M\times N}$对标量$x\in\mathbb R$求导 | $\frac{\partial \mathbf Y}{\partial x}=\begin{bmatrix}\frac{\partial y_{11}}{\partial x}&\cdots&\frac{\partial y_{1N}}{\partial x}\ \vdots&\ddots&\vdots\ \frac{\partial y_{M1}}{\partial x}&\cdots&\frac{\partial y_{NM}}{\partial x}\end{bmatrix}$ | $\frac{\partial \mathbf Y}{\partial x}=\begin{bmatrix}\frac{\partial y_{11}}{\partial x}&\cdots&\frac{\partial y_{M1}}{\partial x}\ \vdots&\ddots&\vdots\ \frac{\partial y_{1N}}{\partial x}&\cdots&\frac{\partial y_{NM}}{\partial x}\end{bmatrix}$ |
| 标量$x\in\mathbb R$对矩阵$\mathbf X\in\mathbb R^{M\times N}$求导 | $\frac{\partial y}{\partial \mathbf X}=\begin{bmatrix}\frac{\partial y}{\partial x_{11}}&\cdots&\frac{\partial y}{\partial x_{M1}}\ \vdots&\ddots&\vdots\ \frac{\partial y}{\partial x_{1N}}&\cdots&\frac{\partial y}{\partial x_{NM}}\end{bmatrix}$ | $\frac{\partial y}{\partial \mathbf X}=\begin{bmatrix}\frac{\partial y}{\partial x_{11}}&\cdots&\frac{\partial y}{\partial x_{1N}}\ \vdots&\ddots&\vdots\ \frac{\partial y}{\partial x_{M1}}&\cdots&\frac{\partial y}{\partial x_{NM}}\end{bmatrix}$ |
在本文中,向量默认是列向量,并且使用分母布局。
- 分母布局下:
- 当有矩阵时,导数的维度由分母矩阵的维度或分子矩阵的转置维度决定。
- 当有向量时,求导无所谓行/列向量,只关注维度,导数的维度是分母向量维度$\times$分子向量维度。
导数法则
-
加减法则 \(\frac{\partial \mathbf {(y+z)}}{\partial \mathbf x}=\frac{\partial \mathbf y}{\partial \mathbf x}+\frac{\partial\mathbf z}{\partial\mathbf x}\)
-
乘法法则 \(\begin{align} \frac{\partial\mathbf y^\mathrm T\mathbf z}{\partial \mathbf x}=\frac{\partial\mathbf y}{\partial\mathbf x}\mathbf z+\frac{\partial \mathbf z}{\partial\mathbf x}\mathbf y\\ \frac{\partial\mathbf y^\mathrm T\mathbf {Az}}{\partial\mathbf x}=\frac{\partial\mathbf y}{\partial\mathbf x}\mathbf {Az}+\frac{\partial \mathbf z}{\partial \mathbf x}\mathbf A^\mathrm T\mathbf y\\ \frac{\partial y\mathbf z}{\partial \mathbf x}=y\frac{\partial \mathbf z}{\partial \mathbf x}+\frac{\partial y}{\partial \mathbf x}\mathbf z^\mathrm T \end{align}\)
-
链式法则(分母布局下是左乘,跟普通的链式法则书写习惯可能不一致) \(\frac{\partial \mathbf z}{\partial x}=\frac{\partial\mathbf y}{\partial x}\frac{\partial \mathbf z}{\partial \mathbf y}\\ \frac{\partial \mathbf z}{\partial \mathbf x}=\frac{\partial\mathbf y}{\partial\mathbf x}\frac{\partial\mathbf z}{\partial\mathbf y}\\ \frac{\partial z}{\partial \mathbf X}=\Big[\frac{\partial \mathbf y}{\partial x_{ij}}\frac{\partial z}{\partial\mathbf y}\Big]\)
常见公式
\[\frac{\partial \mathbf x}{\partial \mathbf x}=\mathbf I\\ \frac{\partial \parallel\mathbf x\parallel^2}{\partial \mathbf x}=2\mathbf x\\ \frac{\partial\mathbf {Ax}}{\partial\mathbf x}=\mathbf A^\mathrm T\\ \frac{\partial\mathbf x^\mathrm T\mathbf A}{\partial \mathbf x}=\mathbf A\]按位计算的向量函数
-
对$\mathbf z=[f(x_1),\cdots,f(x_N)]^\mathrm T$,有
\[\begin{align} \frac{\partial\mathbf z}{\partial\mathbf x}=&\begin{bmatrix}\frac{\partial z_j}{\partial x_i}\end{bmatrix}\\ =&\begin{bmatrix} \frac{\partial f(x_1)}{\partial x_1}&0&\cdots&0\\ 0&\frac{\partial f(x_2)}{\partial x_2}&\cdots&0\\ \vdots&\vdots&\ddots&\vdots\\ 0&0&\cdots&\frac{\partial f(x_N)}{\partial x_N} \end{bmatrix}\\ =&\mathrm{diag}\big(f'(\mathbf x)\big) \end{align}\] -
例:$\sigma(x)=\frac{1}{1+\exp(-x)}$
- \[\sigma'(x)=\sigma(x)\big(1-\sigma(x)\big)\\ \sigma'(\mathbf x)=\mathrm{diag}\Big(\sigma(\mathbf x)\odot\big(1-\sigma(\mathbf x)\big)\Big)\]
-
例:$\mathrm{softmax}(x_k)=\frac{\exp(x_k)}{\sum_{i=1}^K\exp(x_i)}$
- \[\begin{align} \mathrm{softmax}(\mathbf x)=&\frac{1}{\sum_{k=1}^K\exp(x_k)}\begin{bmatrix}\exp(x_1)\\\vdots\\\exp(x_K)\end{bmatrix}\\ =&\frac{\exp(\mathbf x)}{\mathbf 1_K^\mathrm T\exp(\mathbf x)} \end{align}\]
- \[\begin{align} \frac{\partial\mathrm{softmax}(\mathbf x)}{\partial \mathbf x}=&\frac{\partial\Big(\frac{\exp(\mathbf x)}{\mathbf 1_K^\mathrm T\exp(\mathbf x)}\Big)}{\partial\mathbf x}\\ =&\frac{1}{\mathbf 1_K^\mathrm T\exp(\mathbf x)}\frac{\partial\exp(\mathbf x)}{\partial\mathbf x}+\frac{\partial\Big(\frac{1}{\mathbf 1_K^\mathrm T\exp(\mathbf x)}\Big)}{\partial\mathbf x}\big(\exp(\mathbf x)\big)^\mathrm T\\ =&\frac{\mathrm{diag}\big(\exp(\mathbf x)\big)}{\mathbf 1_K^\mathrm T\exp(\mathbf x)}-\Big(\frac{1}{\mathbf 1_K^\mathrm T\exp(\mathbf x)}\Big)^2\frac{\partial\big(\mathbf 1_K^\mathrm T\exp(\mathbf x)\big)}{\partial\mathbf x}\big(\exp(\mathbf x)\big)^\mathrm T\\ =&\frac{\mathrm{diag}\big(\exp(\mathbf x)\big)}{\mathbf 1_K^\mathrm T\exp(\mathbf x)}-\Big(\frac{1}{\mathbf 1_K^\mathrm T\exp(\mathbf x)}\Big)^2\mathrm{diag}\big(\exp(\mathbf x)\big)\mathbf 1_K\big(\exp(\mathbf x)\big)^\mathrm T\\ =&\frac{\mathrm{diag}\big(\exp(\mathbf x)\big)}{\mathbf 1_K^\mathrm T\exp(\mathbf x)}-\Big(\frac{1}{\mathbf 1_K^\mathrm T\exp(\mathbf x)}\Big)^2\exp(\mathbf x)\big(\exp(\mathbf x)\big)^\mathrm T\\ =&\mathrm{diag}\Big(\frac{\exp(\mathbf x)}{\mathbf 1_K^\mathrm T\exp(\mathbf x)}\Big)-\frac{\exp(\mathbf x)}{\mathbf 1_K^\mathrm T\exp(\mathbf x)}\frac{\big(\exp(\mathbf x)\big)^\mathrm T}{\mathbf 1_K^\mathrm T\exp(\mathbf x)}\\ =&\mathrm{diag}\big(\mathrm{softmax}(\mathbf x)\big)-\mathrm{softmax}(\mathbf x)\mathrm{softmax}(\mathbf x)^\mathrm T \end{align}\]