Softmax function and its derivative

It is easy to find the explanation of the derivative of the Softmax function for a single sample with $n$ features, but finding the explanation for multiple samples with $n$ features becomes difficult. Here you will find the derivative and its vector version to optimize its computation.

\frac{\partial \sigma_{q}} {\partial \mathbf{z}} \Rightarrow \cdots \Rightarrow \frac{\mathrm{d} \mathbf{\Sigma}} {\mathrm{d} \mathbf{Z}}

(1)

from autograd import jacobian, numpy as np

from platform import python_version
python_version()

'3.12.12'

We are going to use $\color{Cyan}{\text{autograd}}$ to make a comparison between our scratch implementation and the automatic differentiation implementation.

Mean absolute percentage error

# This cell imports numpy_mape 
# if you are running this notebook locally 
# or from Google Colab.

import os
import sys

module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

try:
    from tools.numpy_metrics import np_mape as mape
    print('mape imported locally.')
except ModuleNotFoundError:
    import subprocess

    repo_url = 'https://raw.githubusercontent.com/PilotLeoYan/inside-deep-learning/main/content/tools/numpy_metrics.py'
    local_file = 'numpy_metrics.py'
    
    subprocess.run(['wget', repo_url, '-O', local_file], check=True)
    try:
        from numpy_metrics import np_mape as mape # type: ignore
        print('mape imported from GitHub.')
    except Exception as e:
        print(e)

mape imported locally.

Softmax function¶

M: int = 1000 # number of samples
CLASSES: int = 5 # number of classes

Z = np.random.randint(-30, 31, (M, CLASSES)) / 2
Z.shape, Z.dtype

((1000, 5), dtype('float64'))

one example¶

For a single sample with $n_{1}$ features

\mathbf{z} \in \mathbb{R}^{1 \times n_{1}}

(2)

We will represent $\text{softmax}$ as $\sigma$ . Then

\sigma(\mathbf{z})_{q} = \frac{\exp (z_{q})} {\sum_{k=1}^{n_{1}} \exp (z_{k})} \in \mathbb{R}^{+}

(3)

where $n_{1}$ is the number of classes

\sigma(\mathbf{z}) = \begin{bmatrix} \sigma(\mathbf{z})_1 & \sigma(\mathbf{z})_2 & \cdots & \sigma(\mathbf{z})_{n_{1}} \end{bmatrix} \in \mathbb{R}^{1 \times n_{1}}

(4)

from scipy.special import softmax

soft_out_1 = softmax(Z[:1])
soft_out_1, soft_out_1.shape

(array([[1.84840853e-08, 1.64251625e-01, 9.96236462e-02, 7.36124711e-01,
         8.39176173e-13]]),
 (1, 5))

# our softmax function
def my_softmax_1(z: np.ndarray) -> np.ndarray:
    exp = np.exp(z)
    return exp / np.sum(exp)

my_soft_out_1 = my_softmax_1(Z[:1])
my_soft_out_1, my_soft_out_1.shape

(array([[1.84840853e-08, 1.64251625e-01, 9.96236462e-02, 7.36124711e-01,
         8.39176173e-13]]),
 (1, 5))

mape(
    my_soft_out_1,
    soft_out_1
)

9.976117231529676e-15

multiple examples¶

For a input with $m$ samples and $n_{1}$ features $\mathbf{Z} \in \mathbb{R}^{m \times n_{1}}$

\mathbf{Z} = \begin{bmatrix} \mathbf{z}_{1,:} \\ \mathbf{z}_{2,:} \\ \vdots \\ \mathbf{z}_{m,:} \end{bmatrix}

(5)

then, the softmax over each example is

\mathbf{\Sigma(Z)} = \begin{bmatrix} \sigma(\mathbf{z}_{1,:}) \\ \sigma(\mathbf{z}_{2,:}) \\ \vdots \\ \sigma(\mathbf{z}_{m,:}) \\ \end{bmatrix} \in \mathbb{R}^{m \times n_{1}}

(6)

Note: We use $\mathbf{\Sigma(Z)}$ to denote that is a matrix.

soft_out_2 = softmax(Z, axis=1)
soft_out_2.shape

(1000, 5)

def my_softmax_2(z: np.ndarray) -> np.ndarray:
    exp = np.exp(z)
    return exp / np.sum(exp, axis=1, keepdims=True)

my_soft_out_2 = my_softmax_2(Z)
my_soft_out_2.shape

(1000, 5)

mape(
    my_soft_out_2,
    soft_out_2
)

1.1287899276384107e-14

Gradient¶

derivation of a softmax with respect to a feature¶

n_selected: int = 3 # select a feature to derive

def my_softmax_0(z: np.ndarray, j_feature: int) -> float:
    exp = np.exp(z)
    return exp[0, j_feature] / np.sum(exp)

grad_0 = jacobian(my_softmax_0, 0)(
    Z[:1],
    n_selected
)
grad_0

array([[-1.36065919e-08, -1.20909680e-01, -7.33354278e-02,
         1.94245121e-01, -6.17738318e-13]])

\frac{\partial \sigma_{q}} {\partial \mathbf{z}} \in \mathbb{R}^{1 \times n_{1}}

(7)

because $\mathbf{z} \in \mathbb{R}^{1 \times n_{1}}$ and $\sigma(\mathbf{z})_{q} \in \mathbb{R}$ .

Then its jacobian is

\frac{\partial \sigma_{q}} {\partial \mathbf{z}} = \begin{bmatrix} \frac{\partial \sigma_{q}}{\partial z_{1}} & \frac{\partial \sigma_{q}}{\partial z_{2}} & \cdots & \frac{\partial \sigma_{q}}{\partial z_{n_{1}}} \end{bmatrix}

(8)

there are two different types of the derivatives:

\frac{\partial \sigma_{q}} {\partial z_{r=q}}

(9)

and

\frac{\partial \sigma_{q}} {\partial z_{r\neq q}}

(10)

First case:

\frac{\partial \sigma_{q}} {\partial z_{r=q}} = \sigma(\mathbf{z})_{q} (1 - \sigma(\mathbf{z})_q)

(11)

Second case:

\frac{\partial \sigma_{q}} {\partial z_{r\neq q}} = -\sigma(\mathbf{z})_{q} \sigma(\mathbf{z})_{r}

(12)

Therefore

\frac{\partial \sigma_{q}} {\partial \mathbf{z}} = \begin{bmatrix} -\sigma(\mathbf{z})_{q} \sigma(\mathbf{z})_{1} & \cdots & \sigma(\mathbf{z})_{q}(1 - \sigma(\mathbf{z})_j) & \cdots & -\sigma(\mathbf{z})_{q} \sigma(\mathbf{z})_{n_{1}} \end{bmatrix}

(13)

or as vectorized form

\frac{\partial \sigma_{q}} {\partial \mathbf{z}} = \sigma(\mathbf{z})_{q} \begin{bmatrix} -\sigma(\mathbf{z})_{1} & \cdots & 1 - \sigma(\mathbf{z})_{q} & \cdots & -\sigma(\mathbf{z})_{n_{1}} \end{bmatrix}

(14)

def my_der_softmax_0(z: np.ndarray, j_feature) -> np.ndarray:
    soft = my_softmax_1(z)
    soft_j = soft[0, j_feature]
    soft *= -1
    soft[0, j_feature] += 1
    return soft_j * soft

my_grad_0 = my_der_softmax_0(Z[:1], n_selected)
my_grad_0

array([[-1.36065919e-08, -1.20909680e-01, -7.33354278e-02,
         1.94245121e-01, -6.17738318e-13]])

mape(
    my_grad_0,
    grad_0
)

7.585056954670048e-15

derivative of a softmax with respect to a sample¶

grad_1 = jacobian(my_softmax_1, 0)(Z[:1])
grad_1.shape

(1, 5, 1, 5)

\frac{\partial \sigma}{\partial \mathbf{z}} \in \mathbb{R}^{(1 \times n_{1}) \times (1 \times n_{1})}

(15)

to simplify the derivative, we will ignore the axes of 1 for now

\frac{\partial \sigma}{\partial \mathbf{z}} \in \mathbb{R}^{n_{1} \times n_{1}}

(16)

this derivative is

\begin{align*} \frac{\partial \sigma}{\partial \mathbf{z}} &= \begin{bmatrix} \frac{\partial \sigma_{1}}{\partial \mathbf{z}} \\ \frac{\partial \sigma_{2}}{\partial \mathbf{z}} \\ \vdots \\ \frac{\partial \sigma_{n_{1}}}{\partial \mathbf{z}} \end{bmatrix} \\ &= \begin{bmatrix} \frac{\partial \sigma_{1}}{\partial z_{1}} & \frac{\partial \sigma_{1}}{\partial z_{2}} & \cdots & \frac{\partial \sigma_{1}}{\partial z_{n_{1}}} \\ \frac{\partial \sigma_{2}}{\partial z_{1}} & \frac{\partial \sigma_{2}}{\partial z_{2}} & \cdots & \frac{\partial \sigma_{2}}{\partial z_{n_{1}}} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial \sigma_{n_{1}}}{\partial z_{1}} & \frac{\partial \sigma_{n_{1}}}{\partial z_{2}} & \cdots & \frac{\partial \sigma_{n_{1}}}{\partial z_{n_{1}}} \\ \end{bmatrix} \end{align*}

(17)

where

\begin{align*} \frac{\partial \sigma_{q}}{\partial z_{r=q}} &= \sigma(\mathbf{z})_{q} (1 - \sigma(\mathbf{z})_{q}) \\ \frac{\partial \sigma_{q}}{\partial z_{r \neq q}} &= -\sigma(\mathbf{z})_{q} \sigma(\mathbf{z})_{r} \end{align*}

(18)

therefore

\frac{\partial \sigma}{\partial \mathbf{z}} = \begin{bmatrix} \sigma(\mathbf{z})_{1} (1 - \sigma(\mathbf{z})_1) & -\sigma(\mathbf{z})_{1} \sigma(\mathbf{z})_{2} & \cdots & -\sigma(\mathbf{z})_{1} \sigma(\mathbf{z})_{n_{1}} \\ -\sigma(\mathbf{z})_{2} \sigma(\mathbf{z})_{1} & \sigma(\mathbf{z})_{2} (1 - \sigma(\mathbf{z})_2) & \cdots & -\sigma(\mathbf{z})_{2} \sigma(\mathbf{z})_{n_{1}} \\ \vdots & \vdots & \ddots & \vdots \\ -\sigma(\mathbf{z})_{n_{1}} \sigma(\mathbf{z})_{1} & -\sigma(\mathbf{z})_{n_{1}} \sigma(\mathbf{z})_{2} & \cdots & \sigma(\mathbf{z})_{n_{1}} (1 - \sigma(\mathbf{z})_{n_{1}}) \end{bmatrix}

(19)

or as vectorized form

\frac{\partial \sigma}{\partial \mathbf{z}} = \text{diag}(\sigma(\mathbf{z})) - \sigma(\mathbf{z}) \sigma(\mathbf{z})^\top

(20)

def my_der_softmax_1(z: np.ndarray) -> np.ndarray:
    soft = my_softmax_1(z).squeeze() # is necesary for numpy to work
    return np.diag(soft) - np.outer(soft, soft)

my_grad_1 = my_der_softmax_1(Z[:1])
my_grad_1.shape

(5, 5)

mape(
    my_grad_1,
    grad_1.squeeze()
)

7.273094613002547e-15

derivation of multiple softmaxes with respect to multiple samples¶

Problem statement¶

\mathbf{Z} \in \mathbb{R}^{m \times n_{1}}

(21)

where $m$ is the number of samples.

Then softmax function is

\mathbf{\Sigma(Z)} = \begin{bmatrix} \sigma(\mathbf{z}_{1,:}) \\ \sigma(\mathbf{z}_{2,:}) \\ \vdots \\ \sigma(\mathbf{z}_{m,:}) \end{bmatrix} \in \mathbb{R}^{m \times n_{1}}

(22)

where

\mathbf{z}_{i,:} = \begin{bmatrix} z_{i1} & z_{i2} & \cdots & z_{in_{1}} \end{bmatrix} \in \mathbb{R}^{1 \times n_{1}}

(23)

for all $i = 1, \ldots, m$ .

therefore

\sigma(\mathbf{z}_{i,:}) = \begin{bmatrix} \sigma(\mathbf{z}_{i,:})_{1} & \sigma(\mathbf{z}_{i,:})_{2} & \cdots & \sigma(\mathbf{z}_{i,:})_{n_{1}} \end{bmatrix} \in \mathbb{R}^{1 \times n_{1}}

(24)

derivative¶

grad_2 = jacobian(my_softmax_2, 0)(Z)
grad_2.shape

(1000, 5, 1000, 5)

we want to compute

\frac{\mathrm{d} {\color{Cyan} {\mathbf{\Sigma}}}} {\mathrm{d} {\color{Orange} {\mathbf{Z}}}} \in \mathbb{R}^{{\color{Cyan} {(m \times n_{1})}} \times {\color{Orange} {(m \times n_{1})}}}

(25)

where

\frac{\partial {\color{Cyan} {\mathbf{\Sigma}_{pq}}}} {\partial {\color{Orange} {\mathbf{Z}_{ij}}}} \in \mathbb{R}^{{\color{Cyan} {(1 \times 1)} } \times {\color{Orange} {(1 \times 1)}}}

(26)

therefore, the last derivative is

\frac{\partial \mathbf{\Sigma}_{pq}} {\partial \mathbf{Z}_{ij}} = \begin{cases} \sigma(\mathbf{Z})_{pq}(1 - \sigma(\mathbf{Z})_{ij}) & \text{if } p=i, q=j \\ -\sigma(\mathbf{Z})_{pq} \sigma(\mathbf{Z})_{ij} & \text{if } p=i, q\neq j \\ 0 & \text{if } p\neq i \end{cases}

(27)

for all $p, i = 1, \ldots, m$ and $q, j = 1, \ldots, n_{1}$ .

Note: the first 2 cases looks similar to $\frac{\partial \sigma_{q}}{\partial \mathbf{z}}$ .

def my_der_softmax_low_2(z: np.ndarray) -> np.ndarray:
    m, classes = z.shape
    soft = my_softmax_2(z)

    der = np.zeros((m, classes, m, classes), dtype=soft.dtype)

    for i in range(m):
        for q in range(classes):
            for j in range(classes):
                if q == j:
                    der[i, q, i, j] = soft[i, q] * (1 - soft[i, q])
                else:
                    der[i, q, i, j] = -soft[i, q] * soft[i, j]
    return der

my_grad_low_2 = my_der_softmax_low_2(Z)
my_grad_low_2.shape

(1000, 5, 1000, 5)

mape(
    my_grad_low_2,
    grad_2
)

5.863933837043159e-11

This solution is too slow, its efficiency is $\Theta(mn_{1}^2)$ , but we can observe some similarities between this derivative and a previous one

\frac{\partial \mathbf{\Sigma}_{p,:}} {\partial \mathbf{Z}_{p,:}} \approx \frac{\partial \sigma} {\partial \mathbf{z}}

(28)

where

\frac{\partial {\color{Cyan} {\mathbf{\Sigma}_{p,:}}}} {\partial {\color{Orange} {\mathbf{Z}_{i,:}}}} \in \mathbb{R}^{{\color{Cyan} {(1 \times n_{1})}} \times {\color{Orange} {(1 \times n_{1})}}}

(29)

Remark: yes, we need 1’s axes.

Then we have 2 cases:

\frac{\partial \mathbf{\Sigma}_{p,:}} {\partial \mathbf{Z}_{i=p,:}}

(30)

and

\frac{\partial \mathbf{\Sigma}_{p,:}} {\partial \mathbf{Z}_{i\neq p,:}}

(31)

First case:

\frac{\partial \mathbf{\Sigma}_{p,:}} {\partial \mathbf{Z}_{i=p,:}} = \text{diag}(\sigma(\mathbf{Z}_{p,:})) - \sigma(\mathbf{Z}_{p,:}) \sigma(\mathbf{Z}_{p,:})^\top

(32)

Second case:

\frac{\partial \mathbf{\Sigma}_{p,:}} {\partial \mathbf{Z}_{i \neq p,:}} = \mathbf{0}

(33)

def my_der_softmax_2(z: np.ndarray) -> np.ndarray:
    m, classes = z.shape
    der = np.zeros((m, classes, m, classes), dtype=z.dtype)

    for i in range(m):
        der[i, :, i, :] = my_der_softmax_1(z[np.newaxis, i])
    return der

my_grad_2 = my_der_softmax_2(Z)
my_grad_2.shape

(1000, 5, 1000, 5)

mape(
    my_grad_2,
    grad_2
)

5.861268904258177e-11

Gradient using loss function¶

We can often use properties of the loss function to optimize our gradients.
For any loss function

L: \mathbb{R}^{m \times n_{1}} \rightarrow \mathbb{R}

(34)

we can compute the derivative using the chain rule

\frac{\partial L}{\partial z_{pq}} = \sum_{i=1}^{m} \sum_{j=1}^{n_{1}} \frac{\partial L}{\partial \sigma_{ij}} \frac{\partial \sigma_{ij}}{\partial z_{pq}}

(35)

for all $p = 1, \ldots, m$ and $q = 1, \ldots, n_{1}$ .

Remark: We are going to focus on computing $\frac{\partial \sigma_{ij}}{\partial z_{pq}}$ .

where

\frac{\partial \sigma_{ij}}{\partial z_{pq}} = \begin{cases} \sigma(\mathbf{z}_{p,:})_{q}(1 - \sigma(\mathbf{z}_{p,:})_{q}) & \text{if } i=p, j=q \\ -\sigma(\mathbf{z}_{p,:})_{q} \sigma(\mathbf{z}_{i,:})_{j} & \text{if } i=p, j \neq q \\ 0 & \text{otherwise} \end{cases}

(36)

therefore

\begin{align*} \frac{\partial L}{\partial z_{pq}} =& \sum_{j=1}^{n_{1}} \frac{\partial L}{\partial \sigma_{pj}} \frac{\partial \sigma_{pj}}{\partial z_{pq}} \\ =& \sum_{j=1}^{n_{1}} \frac{\partial L}{\partial \sigma_{pj}} \begin{cases} \sigma(z_{pq})(1 - \sigma(z_{pq})) & \text{if } j=q \\ -\sigma(z_{pq}) \sigma(z_{pj}) & \text{if } j \neq q \end{cases} \\ =& \sigma(\mathbf{z}_{p,:})_{q} \left( \frac{\partial L}{\partial \sigma_{pq}} - \sum_{j=1}^{n_{1}} \frac{\partial L}{\partial \sigma_{pj}} \sigma(\mathbf{z}_{p,:})_{j} \right) \end{align*}

(37)

in general

\frac{\partial L}{\partial \mathbf{Z}} = \mathbf{\Sigma} \odot \left( \frac{\partial L}{\partial \mathbf{\Sigma}} - \left( \frac{\partial L}{\partial \mathbf{\Sigma}} \odot \mathbf{\Sigma} \right) \mathbf{1} \right)

(38)

where $\mathbf{1} \in \mathbb{R}^{n_{1} \times n_{1}}$ .

example loss function¶

def loss_function(a: np.ndarray) -> np.ndarray:
    return np.sum(a ** 2)

loss_soft_grad = jacobian(lambda z: loss_function(my_softmax_2(z)))(Z)
loss_soft_grad.shape

(1000, 5)

loss_grad = jacobian(loss_function)(soft_out_2)
loss_grad.shape

(1000, 5)

def my_loss_soft_der(loss_grad: np.ndarray, soft: np.ndarray) -> np.ndarray:
    return soft * (loss_grad - np.sum(loss_grad * soft, axis=1, keepdims=True))

my_loss_soft_grad = my_loss_soft_der(loss_grad, soft_out_2)
my_loss_soft_grad.shape

(1000, 5)

mape(
    my_loss_soft_grad,
    loss_soft_grad
)

6.197734362268896e-07

Underflow and Overflow¶

z_flow = Z[:5] * 100
z_flow.shape

(5, 5)

my_softmax_2(z_flow)

/opt/hostedtoolcache/Python/3.12.12/x64/lib/python3.12/site-packages/autograd/tracer.py:54: RuntimeWarning: overflow encountered in exp
  return f_raw(*args, **kwargs)
/tmp/ipykernel_2426/706199401.py:3: RuntimeWarning: invalid value encountered in divide
  return exp / np.sum(exp, axis=1, keepdims=True)

array([[ 0., nan, nan, nan,  0.],
       [ 0.,  0.,  0.,  0., nan],
       [ 1.,  0.,  0.,  0.,  0.],
       [ 0., nan,  0.,  0.,  0.],
       [ 0., nan,  0., nan,  0.]])

If $c$ is very negative, then $\exp(c)$ will underflow. This means the denominator of the softmax will become 0, so the final result is undefined. When $c$ is very large and positive, $\exp(c)$ will overflow, again resulting in the expression as a whole being undefined.

Reference: Goodfellow, I. J., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press, p. 81.

To fix this, we can use a subtract.

\sigma(\mathbf{z}) = \sigma(\mathbf{z} - \max_{i} \mathbf{z}_{i})

(39)

this analytically does not change, because the probability of $z$ is equal to the probability of $z - \max z$ .

my_softmax_2(z_flow - np.max(z_flow, axis=1, keepdims=True))

array([[0.00000000e+000, 7.17509597e-066, 1.38389653e-087,
        1.00000000e+000, 0.00000000e+000],
       [0.00000000e+000, 0.00000000e+000, 0.00000000e+000,
        0.00000000e+000, 1.00000000e+000],
       [1.00000000e+000, 0.00000000e+000, 0.00000000e+000,
        0.00000000e+000, 0.00000000e+000],
       [0.00000000e+000, 1.00000000e+000, 0.00000000e+000,
        0.00000000e+000, 0.00000000e+000],
       [0.00000000e+000, 1.00000000e+000, 0.00000000e+000,
        3.69388307e-196, 0.00000000e+000]])

Appendix¶

when $r = q$ :

\begin{align*} \frac{\partial \sigma_{q}}{\partial z_{r=q}} &= \frac{\exp(z_{q})(\sum_{k=1}^{{n_{1}}}\exp(z_{k})) - \exp(z_{q})^{2}} {(\sum_{k=1}^{{n_{1}}}\exp(z_{k}))^{2}} \\ &= \frac{\exp(z_{q})(\sum_{k=1}^{{n_{1}}}\exp(z_{k}) - \exp(z_{q}))} {(\sum_{k=1}^{{n_{1}}}\exp(z_{k}))^{2}} \\ &= \frac{\exp(z_{q})}{\sum_{k=1}^{{n_{1}}}\exp(z_{k})} \left( \frac{\sum_{k=1}^{{n_{1}}}\exp(z_{k}) - \exp(z_{q})} {\sum_{k=1}^{{n_{1}}}\exp(z_{k})} \right) \\ &= \sigma(\mathbf{z})_{q} \left( \frac{\sum_{k=1}^{{n_{1}}}\exp(z_{k})}{\sum_{k=1}^{{n_{1}}}\exp(z_{k})} - \frac{\exp(z_{q})}{\sum_{k=1}^{{n_{1}}}\exp(z_{k})} \right) \\ &= \sigma(\mathbf{z})_{q} (1 - \sigma(\mathbf{z})_{q}) \end{align*}

(40)

when $r \neq q$ :

\begin{align*} \frac{\partial \sigma_{q}}{\partial z_{r \neq q}} &= - \frac{\exp(z_{q}) \exp(z_{r})} {(\sum_{k=1}^{{n_{1}}}\exp(z_{k}))^{2}} \\ &= - \frac{\exp(z_{q})}{\sum_{k=1}^{{n_{1}}}\exp(z_{k})} \left( \frac{\exp(z_{r})}{\sum_{k=1}^{{n_{1}}}\exp(z_{k})} \right) \\ &= - \sigma(\mathbf{z})_{q} \sigma(\mathbf{z})_{r} \end{align*}

(41)