Softmax function and its derivative December 9, 2025
It is easy to find the explanation of the derivative of the Softmax function
for a single sample with n n n features,
but finding the explanation for multiple samples with n n n features
becomes difficult.
Here you will find the derivative and its vector version to optimize its computation.
∂ σ q ∂ z ⇒ ⋯ ⇒ d Σ d Z \frac{\partial \sigma_{q}}
{\partial \mathbf{z}}
\Rightarrow \cdots \Rightarrow
\frac{\mathrm{d} \mathbf{\Sigma}}
{\mathrm{d} \mathbf{Z}} ∂ z ∂ σ q ⇒ ⋯ ⇒ d Z d Σ from autograd import jacobian, numpy as np
from platform import python_version
python_version()We are going to use autograd \color{Cyan}{\text{autograd}} autograd to make a comparison between our
scratch implementation and the automatic differentiation implementation.
Mean absolute percentage error
# This cell imports numpy_mape
# if you are running this notebook locally
# or from Google Colab.
import os
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
sys.path.append(module_path)
try:
from tools.numpy_metrics import np_mape as mape
print('mape imported locally.')
except ModuleNotFoundError:
import subprocess
repo_url = 'https://raw.githubusercontent.com/PilotLeoYan/inside-deep-learning/main/content/tools/numpy_metrics.py'
local_file = 'numpy_metrics.py'
subprocess.run(['wget', repo_url, '-O', local_file], check=True)
try:
from numpy_metrics import np_mape as mape # type: ignore
print('mape imported from GitHub.')
except Exception as e:
print(e)M: int = 1000 # number of samples
CLASSES: int = 5 # number of classes
Z = np.random.randint(-30, 31, (M, CLASSES)) / 2
Z.shape, Z.dtype((1000, 5), dtype('float64'))
For a single sample with n 1 n_{1} n 1 features
z ∈ R 1 × n 1 \mathbf{z} \in \mathbb{R}^{1 \times n_{1}} z ∈ R 1 × n 1 We will represent softmax \text{softmax} softmax as σ \sigma σ . Then
σ ( z ) q = exp ( z q ) ∑ k = 1 n 1 exp ( z k ) ∈ R + \sigma(\mathbf{z})_{q} =
\frac{\exp (z_{q})}
{\sum_{k=1}^{n_{1}} \exp (z_{k})}
\in \mathbb{R}^{+} σ ( z ) q = ∑ k = 1 n 1 exp ( z k ) exp ( z q ) ∈ R + where n 1 n_{1} n 1 is the number of classes
σ ( z ) = [ σ ( z ) 1 σ ( z ) 2 ⋯ σ ( z ) n 1 ] ∈ R 1 × n 1 \sigma(\mathbf{z}) = \begin{bmatrix}
\sigma(\mathbf{z})_1 & \sigma(\mathbf{z})_2 & \cdots & \sigma(\mathbf{z})_{n_{1}}
\end{bmatrix}
\in \mathbb{R}^{1 \times n_{1}} σ ( z ) = [ σ ( z ) 1 σ ( z ) 2 ⋯ σ ( z ) n 1 ] ∈ R 1 × n 1 from scipy.special import softmax
soft_out_1 = softmax(Z[:1])
soft_out_1, soft_out_1.shape(array([[1.84840853e-08, 1.64251625e-01, 9.96236462e-02, 7.36124711e-01,
8.39176173e-13]]),
(1, 5))
# our softmax function
def my_softmax_1(z: np.ndarray) -> np.ndarray:
exp = np.exp(z)
return exp / np.sum(exp)
my_soft_out_1 = my_softmax_1(Z[:1])
my_soft_out_1, my_soft_out_1.shape(array([[1.84840853e-08, 1.64251625e-01, 9.96236462e-02, 7.36124711e-01,
8.39176173e-13]]),
(1, 5))
mape(
my_soft_out_1,
soft_out_1
)For a input with m m m samples and n 1 n_{1} n 1 features
Z ∈ R m × n 1 \mathbf{Z} \in \mathbb{R}^{m \times n_{1}} Z ∈ R m × n 1
Z = [ z 1 , : z 2 , : ⋮ z m , : ] \mathbf{Z} = \begin{bmatrix}
\mathbf{z}_{1,:} \\
\mathbf{z}_{2,:} \\
\vdots \\
\mathbf{z}_{m,:}
\end{bmatrix} Z = ⎣ ⎡ z 1 , : z 2 , : ⋮ z m , : ⎦ ⎤ then, the softmax over each example is
Σ ( Z ) = [ σ ( z 1 , : ) σ ( z 2 , : ) ⋮ σ ( z m , : ) ] ∈ R m × n 1 \mathbf{\Sigma(Z)} = \begin{bmatrix}
\sigma(\mathbf{z}_{1,:}) \\
\sigma(\mathbf{z}_{2,:}) \\
\vdots \\
\sigma(\mathbf{z}_{m,:}) \\
\end{bmatrix}
\in \mathbb{R}^{m \times n_{1}} Σ ( Z ) = ⎣ ⎡ σ ( z 1 , : ) σ ( z 2 , : ) ⋮ σ ( z m , : ) ⎦ ⎤ ∈ R m × n 1 Note : We use Σ ( Z ) \mathbf{\Sigma(Z)} Σ ( Z ) to denote that is a matrix.
soft_out_2 = softmax(Z, axis=1)
soft_out_2.shapedef my_softmax_2(z: np.ndarray) -> np.ndarray:
exp = np.exp(z)
return exp / np.sum(exp, axis=1, keepdims=True)
my_soft_out_2 = my_softmax_2(Z)
my_soft_out_2.shapemape(
my_soft_out_2,
soft_out_2
)derivation of a softmax with respect to a feature ¶ n_selected: int = 3 # select a feature to derive
def my_softmax_0(z: np.ndarray, j_feature: int) -> float:
exp = np.exp(z)
return exp[0, j_feature] / np.sum(exp)
grad_0 = jacobian(my_softmax_0, 0)(
Z[:1],
n_selected
)
grad_0array([[-1.36065919e-08, -1.20909680e-01, -7.33354278e-02,
1.94245121e-01, -6.17738318e-13]])
∂ σ q ∂ z ∈ R 1 × n 1 \frac{\partial \sigma_{q}}
{\partial \mathbf{z}}
\in \mathbb{R}^{1 \times n_{1}} ∂ z ∂ σ q ∈ R 1 × n 1 because z ∈ R 1 × n 1 \mathbf{z} \in \mathbb{R}^{1 \times n_{1}} z ∈ R 1 × n 1
and σ ( z ) q ∈ R \sigma(\mathbf{z})_{q} \in \mathbb{R} σ ( z ) q ∈ R .
∂ σ q ∂ z = [ ∂ σ q ∂ z 1 ∂ σ q ∂ z 2 ⋯ ∂ σ q ∂ z n 1 ] \frac{\partial \sigma_{q}}
{\partial \mathbf{z}} =
\begin{bmatrix}
\frac{\partial \sigma_{q}}{\partial z_{1}} &
\frac{\partial \sigma_{q}}{\partial z_{2}} &
\cdots &
\frac{\partial \sigma_{q}}{\partial z_{n_{1}}}
\end{bmatrix} ∂ z ∂ σ q = [ ∂ z 1 ∂ σ q ∂ z 2 ∂ σ q ⋯ ∂ z n 1 ∂ σ q ] there are two different types of the derivatives:
∂ σ q ∂ z r = q \frac{\partial \sigma_{q}}
{\partial z_{r=q}} ∂ z r = q ∂ σ q ∂ σ q ∂ z r ≠ q \frac{\partial \sigma_{q}}
{\partial z_{r\neq q}} ∂ z r = q ∂ σ q ∂ σ q ∂ z r = q = σ ( z ) q ( 1 − σ ( z ) q ) \frac{\partial \sigma_{q}}
{\partial z_{r=q}} =
\sigma(\mathbf{z})_{q} (1 - \sigma(\mathbf{z})_q) ∂ z r = q ∂ σ q = σ ( z ) q ( 1 − σ ( z ) q ) ∂ σ q ∂ z r ≠ q = − σ ( z ) q σ ( z ) r \frac{\partial \sigma_{q}}
{\partial z_{r\neq q}} =
-\sigma(\mathbf{z})_{q} \sigma(\mathbf{z})_{r} ∂ z r = q ∂ σ q = − σ ( z ) q σ ( z ) r ∂ σ q ∂ z = [ − σ ( z ) q σ ( z ) 1 ⋯ σ ( z ) q ( 1 − σ ( z ) j ) ⋯ − σ ( z ) q σ ( z ) n 1 ] \frac{\partial \sigma_{q}}
{\partial \mathbf{z}} =
\begin{bmatrix}
-\sigma(\mathbf{z})_{q} \sigma(\mathbf{z})_{1} &
\cdots &
\sigma(\mathbf{z})_{q}(1 - \sigma(\mathbf{z})_j) &
\cdots &
-\sigma(\mathbf{z})_{q} \sigma(\mathbf{z})_{n_{1}}
\end{bmatrix} ∂ z ∂ σ q = [ − σ ( z ) q σ ( z ) 1 ⋯ σ ( z ) q ( 1 − σ ( z ) j ) ⋯ − σ ( z ) q σ ( z ) n 1 ] ∂ σ q ∂ z = σ ( z ) q [ − σ ( z ) 1 ⋯ 1 − σ ( z ) q ⋯ − σ ( z ) n 1 ] \frac{\partial \sigma_{q}}
{\partial \mathbf{z}} =
\sigma(\mathbf{z})_{q}
\begin{bmatrix}
-\sigma(\mathbf{z})_{1} &
\cdots &
1 - \sigma(\mathbf{z})_{q} &
\cdots &
-\sigma(\mathbf{z})_{n_{1}}
\end{bmatrix} ∂ z ∂ σ q = σ ( z ) q [ − σ ( z ) 1 ⋯ 1 − σ ( z ) q ⋯ − σ ( z ) n 1 ] def my_der_softmax_0(z: np.ndarray, j_feature) -> np.ndarray:
soft = my_softmax_1(z)
soft_j = soft[0, j_feature]
soft *= -1
soft[0, j_feature] += 1
return soft_j * soft
my_grad_0 = my_der_softmax_0(Z[:1], n_selected)
my_grad_0array([[-1.36065919e-08, -1.20909680e-01, -7.33354278e-02,
1.94245121e-01, -6.17738318e-13]])
mape(
my_grad_0,
grad_0
)derivative of a softmax with respect to a sample ¶ grad_1 = jacobian(my_softmax_1, 0)(Z[:1])
grad_1.shape∂ σ ∂ z ∈ R ( 1 × n 1 ) × ( 1 × n 1 ) \frac{\partial \sigma}{\partial \mathbf{z}} \in
\mathbb{R}^{(1 \times n_{1}) \times (1 \times n_{1})} ∂ z ∂ σ ∈ R ( 1 × n 1 ) × ( 1 × n 1 ) to simplify the derivative, we will ignore the axes of 1 for now
∂ σ ∂ z ∈ R n 1 × n 1 \frac{\partial \sigma}{\partial \mathbf{z}} \in
\mathbb{R}^{n_{1} \times n_{1}} ∂ z ∂ σ ∈ R n 1 × n 1 ∂ σ ∂ z = [ ∂ σ 1 ∂ z ∂ σ 2 ∂ z ⋮ ∂ σ n 1 ∂ z ] = [ ∂ σ 1 ∂ z 1 ∂ σ 1 ∂ z 2 ⋯ ∂ σ 1 ∂ z n 1 ∂ σ 2 ∂ z 1 ∂ σ 2 ∂ z 2 ⋯ ∂ σ 2 ∂ z n 1 ⋮ ⋮ ⋱ ⋮ ∂ σ n 1 ∂ z 1 ∂ σ n 1 ∂ z 2 ⋯ ∂ σ n 1 ∂ z n 1 ] \begin{align*}
\frac{\partial \sigma}{\partial \mathbf{z}} &=
\begin{bmatrix}
\frac{\partial \sigma_{1}}{\partial \mathbf{z}} \\
\frac{\partial \sigma_{2}}{\partial \mathbf{z}} \\
\vdots \\
\frac{\partial \sigma_{n_{1}}}{\partial \mathbf{z}}
\end{bmatrix} \\
&= \begin{bmatrix}
\frac{\partial \sigma_{1}}{\partial z_{1}} &
\frac{\partial \sigma_{1}}{\partial z_{2}} &
\cdots &
\frac{\partial \sigma_{1}}{\partial z_{n_{1}}} \\
\frac{\partial \sigma_{2}}{\partial z_{1}} &
\frac{\partial \sigma_{2}}{\partial z_{2}} &
\cdots &
\frac{\partial \sigma_{2}}{\partial z_{n_{1}}} \\
\vdots & \vdots & \ddots & \vdots \\
\frac{\partial \sigma_{n_{1}}}{\partial z_{1}} &
\frac{\partial \sigma_{n_{1}}}{\partial z_{2}} &
\cdots &
\frac{\partial \sigma_{n_{1}}}{\partial z_{n_{1}}} \\
\end{bmatrix}
\end{align*} ∂ z ∂ σ = ⎣ ⎡ ∂ z ∂ σ 1 ∂ z ∂ σ 2 ⋮ ∂ z ∂ σ n 1 ⎦ ⎤ = ⎣ ⎡ ∂ z 1 ∂ σ 1 ∂ z 1 ∂ σ 2 ⋮ ∂ z 1 ∂ σ n 1 ∂ z 2 ∂ σ 1 ∂ z 2 ∂ σ 2 ⋮ ∂ z 2 ∂ σ n 1 ⋯ ⋯ ⋱ ⋯ ∂ z n 1 ∂ σ 1 ∂ z n 1 ∂ σ 2 ⋮ ∂ z n 1 ∂ σ n 1 ⎦ ⎤ ∂ σ q ∂ z r = q = σ ( z ) q ( 1 − σ ( z ) q ) ∂ σ q ∂ z r ≠ q = − σ ( z ) q σ ( z ) r \begin{align*}
\frac{\partial \sigma_{q}}{\partial z_{r=q}} &=
\sigma(\mathbf{z})_{q} (1 - \sigma(\mathbf{z})_{q}) \\
\frac{\partial \sigma_{q}}{\partial z_{r \neq q}} &=
-\sigma(\mathbf{z})_{q} \sigma(\mathbf{z})_{r}
\end{align*} ∂ z r = q ∂ σ q ∂ z r = q ∂ σ q = σ ( z ) q ( 1 − σ ( z ) q ) = − σ ( z ) q σ ( z ) r ∂ σ ∂ z = [ σ ( z ) 1 ( 1 − σ ( z ) 1 ) − σ ( z ) 1 σ ( z ) 2 ⋯ − σ ( z ) 1 σ ( z ) n 1 − σ ( z ) 2 σ ( z ) 1 σ ( z ) 2 ( 1 − σ ( z ) 2 ) ⋯ − σ ( z ) 2 σ ( z ) n 1 ⋮ ⋮ ⋱ ⋮ − σ ( z ) n 1 σ ( z ) 1 − σ ( z ) n 1 σ ( z ) 2 ⋯ σ ( z ) n 1 ( 1 − σ ( z ) n 1 ) ] \frac{\partial \sigma}{\partial \mathbf{z}} =
\begin{bmatrix}
\sigma(\mathbf{z})_{1} (1 - \sigma(\mathbf{z})_1) &
-\sigma(\mathbf{z})_{1} \sigma(\mathbf{z})_{2} &
\cdots &
-\sigma(\mathbf{z})_{1} \sigma(\mathbf{z})_{n_{1}} \\
-\sigma(\mathbf{z})_{2} \sigma(\mathbf{z})_{1} &
\sigma(\mathbf{z})_{2} (1 - \sigma(\mathbf{z})_2) &
\cdots &
-\sigma(\mathbf{z})_{2} \sigma(\mathbf{z})_{n_{1}} \\
\vdots & \vdots & \ddots & \vdots \\
-\sigma(\mathbf{z})_{n_{1}} \sigma(\mathbf{z})_{1} &
-\sigma(\mathbf{z})_{n_{1}} \sigma(\mathbf{z})_{2} &
\cdots &
\sigma(\mathbf{z})_{n_{1}} (1 - \sigma(\mathbf{z})_{n_{1}})
\end{bmatrix} ∂ z ∂ σ = ⎣ ⎡ σ ( z ) 1 ( 1 − σ ( z ) 1 ) − σ ( z ) 2 σ ( z ) 1 ⋮ − σ ( z ) n 1 σ ( z ) 1 − σ ( z ) 1 σ ( z ) 2 σ ( z ) 2 ( 1 − σ ( z ) 2 ) ⋮ − σ ( z ) n 1 σ ( z ) 2 ⋯ ⋯ ⋱ ⋯ − σ ( z ) 1 σ ( z ) n 1 − σ ( z ) 2 σ ( z ) n 1 ⋮ σ ( z ) n 1 ( 1 − σ ( z ) n 1 ) ⎦ ⎤ ∂ σ ∂ z = diag ( σ ( z ) ) − σ ( z ) σ ( z ) ⊤ \frac{\partial \sigma}{\partial \mathbf{z}} =
\text{diag}(\sigma(\mathbf{z})) - \sigma(\mathbf{z}) \sigma(\mathbf{z})^\top ∂ z ∂ σ = diag ( σ ( z )) − σ ( z ) σ ( z ) ⊤ def my_der_softmax_1(z: np.ndarray) -> np.ndarray:
soft = my_softmax_1(z).squeeze() # is necesary for numpy to work
return np.diag(soft) - np.outer(soft, soft)
my_grad_1 = my_der_softmax_1(Z[:1])
my_grad_1.shapemape(
my_grad_1,
grad_1.squeeze()
)derivation of multiple softmaxes with respect to multiple samples ¶ Z ∈ R m × n 1 \mathbf{Z} \in \mathbb{R}^{m \times n_{1}} Z ∈ R m × n 1 where m m m is the number of samples.
Σ ( Z ) = [ σ ( z 1 , : ) σ ( z 2 , : ) ⋮ σ ( z m , : ) ] ∈ R m × n 1 \mathbf{\Sigma(Z)} = \begin{bmatrix}
\sigma(\mathbf{z}_{1,:}) \\
\sigma(\mathbf{z}_{2,:}) \\
\vdots \\
\sigma(\mathbf{z}_{m,:})
\end{bmatrix} \in \mathbb{R}^{m \times n_{1}} Σ ( Z ) = ⎣ ⎡ σ ( z 1 , : ) σ ( z 2 , : ) ⋮ σ ( z m , : ) ⎦ ⎤ ∈ R m × n 1 z i , : = [ z i 1 z i 2 ⋯ z i n 1 ] ∈ R 1 × n 1 \mathbf{z}_{i,:} = \begin{bmatrix}
z_{i1} & z_{i2} & \cdots & z_{in_{1}}
\end{bmatrix} \in \mathbb{R}^{1 \times n_{1}} z i , : = [ z i 1 z i 2 ⋯ z i n 1 ] ∈ R 1 × n 1 for all i = 1 , … , m i = 1, \ldots, m i = 1 , … , m .
σ ( z i , : ) = [ σ ( z i , : ) 1 σ ( z i , : ) 2 ⋯ σ ( z i , : ) n 1 ] ∈ R 1 × n 1 \sigma(\mathbf{z}_{i,:}) = \begin{bmatrix}
\sigma(\mathbf{z}_{i,:})_{1} &
\sigma(\mathbf{z}_{i,:})_{2} &
\cdots &
\sigma(\mathbf{z}_{i,:})_{n_{1}}
\end{bmatrix} \in \mathbb{R}^{1 \times n_{1}} σ ( z i , : ) = [ σ ( z i , : ) 1 σ ( z i , : ) 2 ⋯ σ ( z i , : ) n 1 ] ∈ R 1 × n 1 grad_2 = jacobian(my_softmax_2, 0)(Z)
grad_2.shaped Σ d Z ∈ R ( m × n 1 ) × ( m × n 1 ) \frac{\mathrm{d} {\color{Cyan} {\mathbf{\Sigma}}}}
{\mathrm{d} {\color{Orange} {\mathbf{Z}}}} \in
\mathbb{R}^{{\color{Cyan} {(m \times n_{1})}}
\times {\color{Orange} {(m \times n_{1})}}} d Z d Σ ∈ R ( m × n 1 ) × ( m × n 1 ) ∂ Σ p q ∂ Z i j ∈ R ( 1 × 1 ) × ( 1 × 1 ) \frac{\partial {\color{Cyan} {\mathbf{\Sigma}_{pq}}}}
{\partial {\color{Orange} {\mathbf{Z}_{ij}}}} \in
\mathbb{R}^{{\color{Cyan} {(1 \times 1)} }
\times {\color{Orange} {(1 \times 1)}}} ∂ Z ij ∂ Σ pq ∈ R ( 1 × 1 ) × ( 1 × 1 ) therefore, the last derivative is
∂ Σ p q ∂ Z i j = { σ ( Z ) p q ( 1 − σ ( Z ) i j ) if p = i , q = j − σ ( Z ) p q σ ( Z ) i j if p = i , q ≠ j 0 if p ≠ i \frac{\partial \mathbf{\Sigma}_{pq}}
{\partial \mathbf{Z}_{ij}} =
\begin{cases}
\sigma(\mathbf{Z})_{pq}(1 - \sigma(\mathbf{Z})_{ij}) & \text{if } p=i, q=j \\
-\sigma(\mathbf{Z})_{pq} \sigma(\mathbf{Z})_{ij} & \text{if } p=i, q\neq j \\
0 & \text{if } p\neq i
\end{cases} ∂ Z ij ∂ Σ pq = ⎩ ⎨ ⎧ σ ( Z ) pq ( 1 − σ ( Z ) ij ) − σ ( Z ) pq σ ( Z ) ij 0 if p = i , q = j if p = i , q = j if p = i for all p , i = 1 , … , m p, i = 1, \ldots, m p , i = 1 , … , m and q , j = 1 , … , n 1 q, j = 1, \ldots, n_{1} q , j = 1 , … , n 1 .
Note : the first 2 cases looks similar to
∂ σ q ∂ z \frac{\partial \sigma_{q}}{\partial \mathbf{z}} ∂ z ∂ σ q .
def my_der_softmax_low_2(z: np.ndarray) -> np.ndarray:
m, classes = z.shape
soft = my_softmax_2(z)
der = np.zeros((m, classes, m, classes), dtype=soft.dtype)
for i in range(m):
for q in range(classes):
for j in range(classes):
if q == j:
der[i, q, i, j] = soft[i, q] * (1 - soft[i, q])
else:
der[i, q, i, j] = -soft[i, q] * soft[i, j]
return der
my_grad_low_2 = my_der_softmax_low_2(Z)
my_grad_low_2.shapemape(
my_grad_low_2,
grad_2
)This solution is too slow, its efficiency is Θ ( m n 1 2 ) \Theta(mn_{1}^2) Θ ( m n 1 2 ) ,
but we can observe some similarities between this derivative and a previous one
∂ Σ p , : ∂ Z p , : ≈ ∂ σ ∂ z \frac{\partial \mathbf{\Sigma}_{p,:}}
{\partial \mathbf{Z}_{p,:}}
\approx \frac{\partial \sigma}
{\partial \mathbf{z}} ∂ Z p , : ∂ Σ p , : ≈ ∂ z ∂ σ ∂ Σ p , : ∂ Z i , : ∈ R ( 1 × n 1 ) × ( 1 × n 1 ) \frac{\partial {\color{Cyan} {\mathbf{\Sigma}_{p,:}}}}
{\partial {\color{Orange} {\mathbf{Z}_{i,:}}}} \in
\mathbb{R}^{{\color{Cyan} {(1 \times n_{1})}}
\times {\color{Orange} {(1 \times n_{1})}}} ∂ Z i , : ∂ Σ p , : ∈ R ( 1 × n 1 ) × ( 1 × n 1 ) Remark : yes, we need 1’s axes.
∂ Σ p , : ∂ Z i = p , : \frac{\partial \mathbf{\Sigma}_{p,:}}
{\partial \mathbf{Z}_{i=p,:}} ∂ Z i = p , : ∂ Σ p , : ∂ Σ p , : ∂ Z i ≠ p , : \frac{\partial \mathbf{\Sigma}_{p,:}}
{\partial \mathbf{Z}_{i\neq p,:}} ∂ Z i = p , : ∂ Σ p , : ∂ Σ p , : ∂ Z i = p , : = diag ( σ ( Z p , : ) ) − σ ( Z p , : ) σ ( Z p , : ) ⊤ \frac{\partial \mathbf{\Sigma}_{p,:}}
{\partial \mathbf{Z}_{i=p,:}} =
\text{diag}(\sigma(\mathbf{Z}_{p,:}))
- \sigma(\mathbf{Z}_{p,:}) \sigma(\mathbf{Z}_{p,:})^\top ∂ Z i = p , : ∂ Σ p , : = diag ( σ ( Z p , : )) − σ ( Z p , : ) σ ( Z p , : ) ⊤ ∂ Σ p , : ∂ Z i ≠ p , : = 0 \frac{\partial \mathbf{\Sigma}_{p,:}}
{\partial \mathbf{Z}_{i \neq p,:}} = \mathbf{0} ∂ Z i = p , : ∂ Σ p , : = 0 def my_der_softmax_2(z: np.ndarray) -> np.ndarray:
m, classes = z.shape
der = np.zeros((m, classes, m, classes), dtype=z.dtype)
for i in range(m):
der[i, :, i, :] = my_der_softmax_1(z[np.newaxis, i])
return der
my_grad_2 = my_der_softmax_2(Z)
my_grad_2.shapemape(
my_grad_2,
grad_2
)Gradient using loss function ¶ We can often use properties of the loss function to optimize our gradients.
For any loss function
L : R m × n 1 → R L: \mathbb{R}^{m \times n_{1}} \rightarrow
\mathbb{R} L : R m × n 1 → R we can compute the derivative using the chain rule
∂ L ∂ z p q = ∑ i = 1 m ∑ j = 1 n 1 ∂ L ∂ σ i j ∂ σ i j ∂ z p q \frac{\partial L}{\partial z_{pq}} =
\sum_{i=1}^{m} \sum_{j=1}^{n_{1}}
\frac{\partial L}{\partial \sigma_{ij}}
\frac{\partial \sigma_{ij}}{\partial z_{pq}} ∂ z pq ∂ L = i = 1 ∑ m j = 1 ∑ n 1 ∂ σ ij ∂ L ∂ z pq ∂ σ ij for all p = 1 , … , m p = 1, \ldots, m p = 1 , … , m and q = 1 , … , n 1 q = 1, \ldots, n_{1} q = 1 , … , n 1 .
Remark : We are going to focus on computing
∂ σ i j ∂ z p q \frac{\partial \sigma_{ij}}{\partial z_{pq}} ∂ z pq ∂ σ ij .
∂ σ i j ∂ z p q = { σ ( z p , : ) q ( 1 − σ ( z p , : ) q ) if i = p , j = q − σ ( z p , : ) q σ ( z i , : ) j if i = p , j ≠ q 0 otherwise \frac{\partial \sigma_{ij}}{\partial z_{pq}} =
\begin{cases}
\sigma(\mathbf{z}_{p,:})_{q}(1 - \sigma(\mathbf{z}_{p,:})_{q}) & \text{if } i=p, j=q \\
-\sigma(\mathbf{z}_{p,:})_{q} \sigma(\mathbf{z}_{i,:})_{j} & \text{if } i=p, j \neq q \\
0 & \text{otherwise}
\end{cases} ∂ z pq ∂ σ ij = ⎩ ⎨ ⎧ σ ( z p , : ) q ( 1 − σ ( z p , : ) q ) − σ ( z p , : ) q σ ( z i , : ) j 0 if i = p , j = q if i = p , j = q otherwise ∂ L ∂ z p q = ∑ j = 1 n 1 ∂ L ∂ σ p j ∂ σ p j ∂ z p q = ∑ j = 1 n 1 ∂ L ∂ σ p j { σ ( z p q ) ( 1 − σ ( z p q ) ) if j = q − σ ( z p q ) σ ( z p j ) if j ≠ q = σ ( z p , : ) q ( ∂ L ∂ σ p q − ∑ j = 1 n 1 ∂ L ∂ σ p j σ ( z p , : ) j ) \begin{align*}
\frac{\partial L}{\partial z_{pq}} =&
\sum_{j=1}^{n_{1}}
\frac{\partial L}{\partial \sigma_{pj}}
\frac{\partial \sigma_{pj}}{\partial z_{pq}} \\
=& \sum_{j=1}^{n_{1}}
\frac{\partial L}{\partial \sigma_{pj}}
\begin{cases}
\sigma(z_{pq})(1 - \sigma(z_{pq})) & \text{if } j=q \\
-\sigma(z_{pq}) \sigma(z_{pj}) & \text{if } j \neq q
\end{cases} \\
=& \sigma(\mathbf{z}_{p,:})_{q} \left(
\frac{\partial L}{\partial \sigma_{pq}}
- \sum_{j=1}^{n_{1}}
\frac{\partial L}{\partial \sigma_{pj}}
\sigma(\mathbf{z}_{p,:})_{j}
\right)
\end{align*} ∂ z pq ∂ L = = = j = 1 ∑ n 1 ∂ σ p j ∂ L ∂ z pq ∂ σ p j j = 1 ∑ n 1 ∂ σ p j ∂ L { σ ( z pq ) ( 1 − σ ( z pq )) − σ ( z pq ) σ ( z p j ) if j = q if j = q σ ( z p , : ) q ( ∂ σ pq ∂ L − j = 1 ∑ n 1 ∂ σ p j ∂ L σ ( z p , : ) j ) ∂ L ∂ Z = Σ ⊙ ( ∂ L ∂ Σ − ( ∂ L ∂ Σ ⊙ Σ ) 1 ) \frac{\partial L}{\partial \mathbf{Z}} =
\mathbf{\Sigma} \odot \left(
\frac{\partial L}{\partial \mathbf{\Sigma}}
- \left(
\frac{\partial L}{\partial \mathbf{\Sigma}}
\odot \mathbf{\Sigma}
\right) \mathbf{1}
\right) ∂ Z ∂ L = Σ ⊙ ( ∂ Σ ∂ L − ( ∂ Σ ∂ L ⊙ Σ ) 1 ) where 1 ∈ R n 1 × n 1 \mathbf{1} \in \mathbb{R}^{n_{1} \times n_{1}} 1 ∈ R n 1 × n 1 .
def loss_function(a: np.ndarray) -> np.ndarray:
return np.sum(a ** 2)
loss_soft_grad = jacobian(lambda z: loss_function(my_softmax_2(z)))(Z)
loss_soft_grad.shapeloss_grad = jacobian(loss_function)(soft_out_2)
loss_grad.shapemape(
my_loss_soft_grad,
loss_soft_grad
)z_flow = Z[:5] * 100
z_flow.shape/opt/hostedtoolcache/Python/3.12.12/x64/lib/python3.12/site-packages/autograd/tracer.py:54: RuntimeWarning: overflow encountered in exp
return f_raw(*args, **kwargs)
/tmp/ipykernel_2426/706199401.py:3: RuntimeWarning: invalid value encountered in divide
return exp / np.sum(exp, axis=1, keepdims=True)
array([[ 0., nan, nan, nan, 0.],
[ 0., 0., 0., 0., nan],
[ 1., 0., 0., 0., 0.],
[ 0., nan, 0., 0., 0.],
[ 0., nan, 0., nan, 0.]])
If c c c is very negative, then exp ( c ) \exp(c) exp ( c ) will underflow. This means the denominator of the softmax will become 0, so the final result is undefined. When c c c is very large and positive, exp ( c ) \exp(c) exp ( c ) will overflow, again resulting in the expression as a whole being undefined.
Reference: Goodfellow, I. J., Bengio, Y., & Courville, A. (2016). Deep Learning . MIT Press, p. 81.
To fix this, we can use a subtract.
σ ( z ) = σ ( z − max i z i ) \sigma(\mathbf{z}) = \sigma(\mathbf{z} - \max_{i} \mathbf{z}_{i}) σ ( z ) = σ ( z − i max z i ) this analytically does not change, because the probability of z z z is equal to the probability of z − max z z - \max z z − max z .
my_softmax_2(z_flow - np.max(z_flow, axis=1, keepdims=True))array([[0.00000000e+000, 7.17509597e-066, 1.38389653e-087,
1.00000000e+000, 0.00000000e+000],
[0.00000000e+000, 0.00000000e+000, 0.00000000e+000,
0.00000000e+000, 1.00000000e+000],
[1.00000000e+000, 0.00000000e+000, 0.00000000e+000,
0.00000000e+000, 0.00000000e+000],
[0.00000000e+000, 1.00000000e+000, 0.00000000e+000,
0.00000000e+000, 0.00000000e+000],
[0.00000000e+000, 1.00000000e+000, 0.00000000e+000,
3.69388307e-196, 0.00000000e+000]])
∂ σ q ∂ z r = q = exp ( z q ) ( ∑ k = 1 n 1 exp ( z k ) ) − exp ( z q ) 2 ( ∑ k = 1 n 1 exp ( z k ) ) 2 = exp ( z q ) ( ∑ k = 1 n 1 exp ( z k ) − exp ( z q ) ) ( ∑ k = 1 n 1 exp ( z k ) ) 2 = exp ( z q ) ∑ k = 1 n 1 exp ( z k ) ( ∑ k = 1 n 1 exp ( z k ) − exp ( z q ) ∑ k = 1 n 1 exp ( z k ) ) = σ ( z ) q ( ∑ k = 1 n 1 exp ( z k ) ∑ k = 1 n 1 exp ( z k ) − exp ( z q ) ∑ k = 1 n 1 exp ( z k ) ) = σ ( z ) q ( 1 − σ ( z ) q ) \begin{align*}
\frac{\partial \sigma_{q}}{\partial z_{r=q}} &=
\frac{\exp(z_{q})(\sum_{k=1}^{{n_{1}}}\exp(z_{k})) - \exp(z_{q})^{2}}
{(\sum_{k=1}^{{n_{1}}}\exp(z_{k}))^{2}} \\
&= \frac{\exp(z_{q})(\sum_{k=1}^{{n_{1}}}\exp(z_{k}) - \exp(z_{q}))}
{(\sum_{k=1}^{{n_{1}}}\exp(z_{k}))^{2}} \\
&= \frac{\exp(z_{q})}{\sum_{k=1}^{{n_{1}}}\exp(z_{k})}
\left(
\frac{\sum_{k=1}^{{n_{1}}}\exp(z_{k}) - \exp(z_{q})}
{\sum_{k=1}^{{n_{1}}}\exp(z_{k})}
\right) \\
&= \sigma(\mathbf{z})_{q} \left(
\frac{\sum_{k=1}^{{n_{1}}}\exp(z_{k})}{\sum_{k=1}^{{n_{1}}}\exp(z_{k})} -
\frac{\exp(z_{q})}{\sum_{k=1}^{{n_{1}}}\exp(z_{k})}
\right) \\
&= \sigma(\mathbf{z})_{q} (1 - \sigma(\mathbf{z})_{q})
\end{align*} ∂ z r = q ∂ σ q = ( ∑ k = 1 n 1 exp ( z k ) ) 2 exp ( z q ) ( ∑ k = 1 n 1 exp ( z k )) − exp ( z q ) 2 = ( ∑ k = 1 n 1 exp ( z k ) ) 2 exp ( z q ) ( ∑ k = 1 n 1 exp ( z k ) − exp ( z q )) = ∑ k = 1 n 1 exp ( z k ) exp ( z q ) ( ∑ k = 1 n 1 exp ( z k ) ∑ k = 1 n 1 exp ( z k ) − exp ( z q ) ) = σ ( z ) q ( ∑ k = 1 n 1 exp ( z k ) ∑ k = 1 n 1 exp ( z k ) − ∑ k = 1 n 1 exp ( z k ) exp ( z q ) ) = σ ( z ) q ( 1 − σ ( z ) q ) ∂ σ q ∂ z r ≠ q = − exp ( z q ) exp ( z r ) ( ∑ k = 1 n 1 exp ( z k ) ) 2 = − exp ( z q ) ∑ k = 1 n 1 exp ( z k ) ( exp ( z r ) ∑ k = 1 n 1 exp ( z k ) ) = − σ ( z ) q σ ( z ) r \begin{align*}
\frac{\partial \sigma_{q}}{\partial z_{r \neq q}} &=
- \frac{\exp(z_{q}) \exp(z_{r})}
{(\sum_{k=1}^{{n_{1}}}\exp(z_{k}))^{2}} \\
&= - \frac{\exp(z_{q})}{\sum_{k=1}^{{n_{1}}}\exp(z_{k})} \left(
\frac{\exp(z_{r})}{\sum_{k=1}^{{n_{1}}}\exp(z_{k})}
\right) \\
&= - \sigma(\mathbf{z})_{q} \sigma(\mathbf{z})_{r}
\end{align*} ∂ z r = q ∂ σ q = − ( ∑ k = 1 n 1 exp ( z k ) ) 2 exp ( z q ) exp ( z r ) = − ∑ k = 1 n 1 exp ( z k ) exp ( z q ) ( ∑ k = 1 n 1 exp ( z k ) exp ( z r ) ) = − σ ( z ) q σ ( z ) r