Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Softmax function and its derivative

It is easy to find the explanation of the derivative of the Softmax function for a single sample with nn features, but finding the explanation for multiple samples with nn features becomes difficult. Here you will find the derivative and its vector version to optimize its computation.

σqzdΣdZ\frac{\partial \sigma_{q}} {\partial \mathbf{z}} \Rightarrow \cdots \Rightarrow \frac{\mathrm{d} \mathbf{\Sigma}} {\mathrm{d} \mathbf{Z}}
from autograd import jacobian, numpy as np

from platform import python_version
python_version()
'3.12.12'

We are going to use autograd\color{Cyan}{\text{autograd}} to make a comparison between our scratch implementation and the automatic differentiation implementation.

Mean absolute percentage error

# This cell imports numpy_mape 
# if you are running this notebook locally 
# or from Google Colab.

import os
import sys

module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

try:
    from tools.numpy_metrics import np_mape as mape
    print('mape imported locally.')
except ModuleNotFoundError:
    import subprocess

    repo_url = 'https://raw.githubusercontent.com/PilotLeoYan/inside-deep-learning/main/content/tools/numpy_metrics.py'
    local_file = 'numpy_metrics.py'
    
    subprocess.run(['wget', repo_url, '-O', local_file], check=True)
    try:
        from numpy_metrics import np_mape as mape # type: ignore
        print('mape imported from GitHub.')
    except Exception as e:
        print(e)
mape imported locally.

Softmax function

M: int = 1000 # number of samples
CLASSES: int = 5 # number of classes
Z = np.random.randint(-30, 31, (M, CLASSES)) / 2
Z.shape, Z.dtype
((1000, 5), dtype('float64'))

one example

For a single sample with n1n_{1} features

zR1×n1\mathbf{z} \in \mathbb{R}^{1 \times n_{1}}

We will represent softmax\text{softmax} as σ\sigma. Then

σ(z)q=exp(zq)k=1n1exp(zk)R+\sigma(\mathbf{z})_{q} = \frac{\exp (z_{q})} {\sum_{k=1}^{n_{1}} \exp (z_{k})} \in \mathbb{R}^{+}

where n1n_{1} is the number of classes

σ(z)=[σ(z)1σ(z)2σ(z)n1]R1×n1\sigma(\mathbf{z}) = \begin{bmatrix} \sigma(\mathbf{z})_1 & \sigma(\mathbf{z})_2 & \cdots & \sigma(\mathbf{z})_{n_{1}} \end{bmatrix} \in \mathbb{R}^{1 \times n_{1}}
from scipy.special import softmax

soft_out_1 = softmax(Z[:1])
soft_out_1, soft_out_1.shape
(array([[1.84840853e-08, 1.64251625e-01, 9.96236462e-02, 7.36124711e-01, 8.39176173e-13]]), (1, 5))
# our softmax function
def my_softmax_1(z: np.ndarray) -> np.ndarray:
    exp = np.exp(z)
    return exp / np.sum(exp)

my_soft_out_1 = my_softmax_1(Z[:1])
my_soft_out_1, my_soft_out_1.shape
(array([[1.84840853e-08, 1.64251625e-01, 9.96236462e-02, 7.36124711e-01, 8.39176173e-13]]), (1, 5))
mape(
    my_soft_out_1,
    soft_out_1
)
9.976117231529676e-15

multiple examples

For a input with mm samples and n1n_{1} features ZRm×n1\mathbf{Z} \in \mathbb{R}^{m \times n_{1}}

Z=[z1,:z2,:zm,:]\mathbf{Z} = \begin{bmatrix} \mathbf{z}_{1,:} \\ \mathbf{z}_{2,:} \\ \vdots \\ \mathbf{z}_{m,:} \end{bmatrix}

then, the softmax over each example is

Σ(Z)=[σ(z1,:)σ(z2,:)σ(zm,:)]Rm×n1\mathbf{\Sigma(Z)} = \begin{bmatrix} \sigma(\mathbf{z}_{1,:}) \\ \sigma(\mathbf{z}_{2,:}) \\ \vdots \\ \sigma(\mathbf{z}_{m,:}) \\ \end{bmatrix} \in \mathbb{R}^{m \times n_{1}}

Note: We use Σ(Z)\mathbf{\Sigma(Z)} to denote that is a matrix.

soft_out_2 = softmax(Z, axis=1)
soft_out_2.shape
(1000, 5)
def my_softmax_2(z: np.ndarray) -> np.ndarray:
    exp = np.exp(z)
    return exp / np.sum(exp, axis=1, keepdims=True)

my_soft_out_2 = my_softmax_2(Z)
my_soft_out_2.shape
(1000, 5)
mape(
    my_soft_out_2,
    soft_out_2
)
1.1287899276384107e-14

Gradient

derivation of a softmax with respect to a feature

n_selected: int = 3 # select a feature to derive

def my_softmax_0(z: np.ndarray, j_feature: int) -> float:
    exp = np.exp(z)
    return exp[0, j_feature] / np.sum(exp)
grad_0 = jacobian(my_softmax_0, 0)(
    Z[:1],
    n_selected
)
grad_0
array([[-1.36065919e-08, -1.20909680e-01, -7.33354278e-02, 1.94245121e-01, -6.17738318e-13]])
σqzR1×n1\frac{\partial \sigma_{q}} {\partial \mathbf{z}} \in \mathbb{R}^{1 \times n_{1}}

because zR1×n1\mathbf{z} \in \mathbb{R}^{1 \times n_{1}} and σ(z)qR\sigma(\mathbf{z})_{q} \in \mathbb{R}.

Then its jacobian is

σqz=[σqz1σqz2σqzn1]\frac{\partial \sigma_{q}} {\partial \mathbf{z}} = \begin{bmatrix} \frac{\partial \sigma_{q}}{\partial z_{1}} & \frac{\partial \sigma_{q}}{\partial z_{2}} & \cdots & \frac{\partial \sigma_{q}}{\partial z_{n_{1}}} \end{bmatrix}

there are two different types of the derivatives:

σqzr=q\frac{\partial \sigma_{q}} {\partial z_{r=q}}

and

σqzrq\frac{\partial \sigma_{q}} {\partial z_{r\neq q}}

First case:

σqzr=q=σ(z)q(1σ(z)q)\frac{\partial \sigma_{q}} {\partial z_{r=q}} = \sigma(\mathbf{z})_{q} (1 - \sigma(\mathbf{z})_q)

Second case:

σqzrq=σ(z)qσ(z)r\frac{\partial \sigma_{q}} {\partial z_{r\neq q}} = -\sigma(\mathbf{z})_{q} \sigma(\mathbf{z})_{r}

Therefore

σqz=[σ(z)qσ(z)1σ(z)q(1σ(z)j)σ(z)qσ(z)n1]\frac{\partial \sigma_{q}} {\partial \mathbf{z}} = \begin{bmatrix} -\sigma(\mathbf{z})_{q} \sigma(\mathbf{z})_{1} & \cdots & \sigma(\mathbf{z})_{q}(1 - \sigma(\mathbf{z})_j) & \cdots & -\sigma(\mathbf{z})_{q} \sigma(\mathbf{z})_{n_{1}} \end{bmatrix}

or as vectorized form

σqz=σ(z)q[σ(z)11σ(z)qσ(z)n1]\frac{\partial \sigma_{q}} {\partial \mathbf{z}} = \sigma(\mathbf{z})_{q} \begin{bmatrix} -\sigma(\mathbf{z})_{1} & \cdots & 1 - \sigma(\mathbf{z})_{q} & \cdots & -\sigma(\mathbf{z})_{n_{1}} \end{bmatrix}
def my_der_softmax_0(z: np.ndarray, j_feature) -> np.ndarray:
    soft = my_softmax_1(z)
    soft_j = soft[0, j_feature]
    soft *= -1
    soft[0, j_feature] += 1
    return soft_j * soft

my_grad_0 = my_der_softmax_0(Z[:1], n_selected)
my_grad_0
array([[-1.36065919e-08, -1.20909680e-01, -7.33354278e-02, 1.94245121e-01, -6.17738318e-13]])
mape(
    my_grad_0,
    grad_0
)
7.585056954670048e-15

derivative of a softmax with respect to a sample

grad_1 = jacobian(my_softmax_1, 0)(Z[:1])
grad_1.shape
(1, 5, 1, 5)
σzR(1×n1)×(1×n1)\frac{\partial \sigma}{\partial \mathbf{z}} \in \mathbb{R}^{(1 \times n_{1}) \times (1 \times n_{1})}

to simplify the derivative, we will ignore the axes of 1 for now

σzRn1×n1\frac{\partial \sigma}{\partial \mathbf{z}} \in \mathbb{R}^{n_{1} \times n_{1}}

this derivative is

σz=[σ1zσ2zσn1z]=[σ1z1σ1z2σ1zn1σ2z1σ2z2σ2zn1σn1z1σn1z2σn1zn1]\begin{align*} \frac{\partial \sigma}{\partial \mathbf{z}} &= \begin{bmatrix} \frac{\partial \sigma_{1}}{\partial \mathbf{z}} \\ \frac{\partial \sigma_{2}}{\partial \mathbf{z}} \\ \vdots \\ \frac{\partial \sigma_{n_{1}}}{\partial \mathbf{z}} \end{bmatrix} \\ &= \begin{bmatrix} \frac{\partial \sigma_{1}}{\partial z_{1}} & \frac{\partial \sigma_{1}}{\partial z_{2}} & \cdots & \frac{\partial \sigma_{1}}{\partial z_{n_{1}}} \\ \frac{\partial \sigma_{2}}{\partial z_{1}} & \frac{\partial \sigma_{2}}{\partial z_{2}} & \cdots & \frac{\partial \sigma_{2}}{\partial z_{n_{1}}} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial \sigma_{n_{1}}}{\partial z_{1}} & \frac{\partial \sigma_{n_{1}}}{\partial z_{2}} & \cdots & \frac{\partial \sigma_{n_{1}}}{\partial z_{n_{1}}} \\ \end{bmatrix} \end{align*}

where

σqzr=q=σ(z)q(1σ(z)q)σqzrq=σ(z)qσ(z)r\begin{align*} \frac{\partial \sigma_{q}}{\partial z_{r=q}} &= \sigma(\mathbf{z})_{q} (1 - \sigma(\mathbf{z})_{q}) \\ \frac{\partial \sigma_{q}}{\partial z_{r \neq q}} &= -\sigma(\mathbf{z})_{q} \sigma(\mathbf{z})_{r} \end{align*}

therefore

σz=[σ(z)1(1σ(z)1)σ(z)1σ(z)2σ(z)1σ(z)n1σ(z)2σ(z)1σ(z)2(1σ(z)2)σ(z)2σ(z)n1σ(z)n1σ(z)1σ(z)n1σ(z)2σ(z)n1(1σ(z)n1)]\frac{\partial \sigma}{\partial \mathbf{z}} = \begin{bmatrix} \sigma(\mathbf{z})_{1} (1 - \sigma(\mathbf{z})_1) & -\sigma(\mathbf{z})_{1} \sigma(\mathbf{z})_{2} & \cdots & -\sigma(\mathbf{z})_{1} \sigma(\mathbf{z})_{n_{1}} \\ -\sigma(\mathbf{z})_{2} \sigma(\mathbf{z})_{1} & \sigma(\mathbf{z})_{2} (1 - \sigma(\mathbf{z})_2) & \cdots & -\sigma(\mathbf{z})_{2} \sigma(\mathbf{z})_{n_{1}} \\ \vdots & \vdots & \ddots & \vdots \\ -\sigma(\mathbf{z})_{n_{1}} \sigma(\mathbf{z})_{1} & -\sigma(\mathbf{z})_{n_{1}} \sigma(\mathbf{z})_{2} & \cdots & \sigma(\mathbf{z})_{n_{1}} (1 - \sigma(\mathbf{z})_{n_{1}}) \end{bmatrix}

or as vectorized form

σz=diag(σ(z))σ(z)σ(z)\frac{\partial \sigma}{\partial \mathbf{z}} = \text{diag}(\sigma(\mathbf{z})) - \sigma(\mathbf{z}) \sigma(\mathbf{z})^\top
def my_der_softmax_1(z: np.ndarray) -> np.ndarray:
    soft = my_softmax_1(z).squeeze() # is necesary for numpy to work
    return np.diag(soft) - np.outer(soft, soft)

my_grad_1 = my_der_softmax_1(Z[:1])
my_grad_1.shape
(5, 5)
mape(
    my_grad_1,
    grad_1.squeeze()
)
7.273094613002547e-15

derivation of multiple softmaxes with respect to multiple samples

Problem statement

ZRm×n1\mathbf{Z} \in \mathbb{R}^{m \times n_{1}}

where mm is the number of samples.

Then softmax function is

Σ(Z)=[σ(z1,:)σ(z2,:)σ(zm,:)]Rm×n1\mathbf{\Sigma(Z)} = \begin{bmatrix} \sigma(\mathbf{z}_{1,:}) \\ \sigma(\mathbf{z}_{2,:}) \\ \vdots \\ \sigma(\mathbf{z}_{m,:}) \end{bmatrix} \in \mathbb{R}^{m \times n_{1}}

where

zi,:=[zi1zi2zin1]R1×n1\mathbf{z}_{i,:} = \begin{bmatrix} z_{i1} & z_{i2} & \cdots & z_{in_{1}} \end{bmatrix} \in \mathbb{R}^{1 \times n_{1}}

for all i=1,,mi = 1, \ldots, m.

therefore

σ(zi,:)=[σ(zi,:)1σ(zi,:)2σ(zi,:)n1]R1×n1\sigma(\mathbf{z}_{i,:}) = \begin{bmatrix} \sigma(\mathbf{z}_{i,:})_{1} & \sigma(\mathbf{z}_{i,:})_{2} & \cdots & \sigma(\mathbf{z}_{i,:})_{n_{1}} \end{bmatrix} \in \mathbb{R}^{1 \times n_{1}}

derivative

grad_2 = jacobian(my_softmax_2, 0)(Z)
grad_2.shape
(1000, 5, 1000, 5)

we want to compute

dΣdZR(m×n1)×(m×n1)\frac{\mathrm{d} {\color{Cyan} {\mathbf{\Sigma}}}} {\mathrm{d} {\color{Orange} {\mathbf{Z}}}} \in \mathbb{R}^{{\color{Cyan} {(m \times n_{1})}} \times {\color{Orange} {(m \times n_{1})}}}

where

ΣpqZijR(1×1)×(1×1)\frac{\partial {\color{Cyan} {\mathbf{\Sigma}_{pq}}}} {\partial {\color{Orange} {\mathbf{Z}_{ij}}}} \in \mathbb{R}^{{\color{Cyan} {(1 \times 1)} } \times {\color{Orange} {(1 \times 1)}}}

therefore, the last derivative is

ΣpqZij={σ(Z)pq(1σ(Z)ij)if p=i,q=jσ(Z)pqσ(Z)ijif p=i,qj0if pi\frac{\partial \mathbf{\Sigma}_{pq}} {\partial \mathbf{Z}_{ij}} = \begin{cases} \sigma(\mathbf{Z})_{pq}(1 - \sigma(\mathbf{Z})_{ij}) & \text{if } p=i, q=j \\ -\sigma(\mathbf{Z})_{pq} \sigma(\mathbf{Z})_{ij} & \text{if } p=i, q\neq j \\ 0 & \text{if } p\neq i \end{cases}

for all p,i=1,,mp, i = 1, \ldots, m and q,j=1,,n1q, j = 1, \ldots, n_{1}.

Note: the first 2 cases looks similar to σqz\frac{\partial \sigma_{q}}{\partial \mathbf{z}}.

def my_der_softmax_low_2(z: np.ndarray) -> np.ndarray:
    m, classes = z.shape
    soft = my_softmax_2(z)

    der = np.zeros((m, classes, m, classes), dtype=soft.dtype)

    for i in range(m):
        for q in range(classes):
            for j in range(classes):
                if q == j:
                    der[i, q, i, j] = soft[i, q] * (1 - soft[i, q])
                else:
                    der[i, q, i, j] = -soft[i, q] * soft[i, j]
    return der

my_grad_low_2 = my_der_softmax_low_2(Z)
my_grad_low_2.shape
(1000, 5, 1000, 5)
mape(
    my_grad_low_2,
    grad_2
)
5.863933837043159e-11

This solution is too slow, its efficiency is Θ(mn12)\Theta(mn_{1}^2), but we can observe some similarities between this derivative and a previous one

Σp,:Zp,:σz\frac{\partial \mathbf{\Sigma}_{p,:}} {\partial \mathbf{Z}_{p,:}} \approx \frac{\partial \sigma} {\partial \mathbf{z}}

where

Σp,:Zi,:R(1×n1)×(1×n1)\frac{\partial {\color{Cyan} {\mathbf{\Sigma}_{p,:}}}} {\partial {\color{Orange} {\mathbf{Z}_{i,:}}}} \in \mathbb{R}^{{\color{Cyan} {(1 \times n_{1})}} \times {\color{Orange} {(1 \times n_{1})}}}

Remark: yes, we need 1’s axes.

Then we have 2 cases:

Σp,:Zi=p,:\frac{\partial \mathbf{\Sigma}_{p,:}} {\partial \mathbf{Z}_{i=p,:}}

and

Σp,:Zip,:\frac{\partial \mathbf{\Sigma}_{p,:}} {\partial \mathbf{Z}_{i\neq p,:}}

First case:

Σp,:Zi=p,:=diag(σ(Zp,:))σ(Zp,:)σ(Zp,:)\frac{\partial \mathbf{\Sigma}_{p,:}} {\partial \mathbf{Z}_{i=p,:}} = \text{diag}(\sigma(\mathbf{Z}_{p,:})) - \sigma(\mathbf{Z}_{p,:}) \sigma(\mathbf{Z}_{p,:})^\top

Second case:

Σp,:Zip,:=0\frac{\partial \mathbf{\Sigma}_{p,:}} {\partial \mathbf{Z}_{i \neq p,:}} = \mathbf{0}
def my_der_softmax_2(z: np.ndarray) -> np.ndarray:
    m, classes = z.shape
    der = np.zeros((m, classes, m, classes), dtype=z.dtype)

    for i in range(m):
        der[i, :, i, :] = my_der_softmax_1(z[np.newaxis, i])
    return der

my_grad_2 = my_der_softmax_2(Z)
my_grad_2.shape
(1000, 5, 1000, 5)
mape(
    my_grad_2,
    grad_2
)
5.861268904258177e-11

Gradient using loss function

We can often use properties of the loss function to optimize our gradients.
For any loss function

L:Rm×n1RL: \mathbb{R}^{m \times n_{1}} \rightarrow \mathbb{R}

we can compute the derivative using the chain rule

Lzpq=i=1mj=1n1Lσijσijzpq\frac{\partial L}{\partial z_{pq}} = \sum_{i=1}^{m} \sum_{j=1}^{n_{1}} \frac{\partial L}{\partial \sigma_{ij}} \frac{\partial \sigma_{ij}}{\partial z_{pq}}

for all p=1,,mp = 1, \ldots, m and q=1,,n1q = 1, \ldots, n_{1}.

Remark: We are going to focus on computing σijzpq\frac{\partial \sigma_{ij}}{\partial z_{pq}}.

where

σijzpq={σ(zp,:)q(1σ(zp,:)q)if i=p,j=qσ(zp,:)qσ(zi,:)jif i=p,jq0otherwise\frac{\partial \sigma_{ij}}{\partial z_{pq}} = \begin{cases} \sigma(\mathbf{z}_{p,:})_{q}(1 - \sigma(\mathbf{z}_{p,:})_{q}) & \text{if } i=p, j=q \\ -\sigma(\mathbf{z}_{p,:})_{q} \sigma(\mathbf{z}_{i,:})_{j} & \text{if } i=p, j \neq q \\ 0 & \text{otherwise} \end{cases}

therefore

Lzpq=j=1n1Lσpjσpjzpq=j=1n1Lσpj{σ(zpq)(1σ(zpq))if j=qσ(zpq)σ(zpj)if jq=σ(zp,:)q(Lσpqj=1n1Lσpjσ(zp,:)j)\begin{align*} \frac{\partial L}{\partial z_{pq}} =& \sum_{j=1}^{n_{1}} \frac{\partial L}{\partial \sigma_{pj}} \frac{\partial \sigma_{pj}}{\partial z_{pq}} \\ =& \sum_{j=1}^{n_{1}} \frac{\partial L}{\partial \sigma_{pj}} \begin{cases} \sigma(z_{pq})(1 - \sigma(z_{pq})) & \text{if } j=q \\ -\sigma(z_{pq}) \sigma(z_{pj}) & \text{if } j \neq q \end{cases} \\ =& \sigma(\mathbf{z}_{p,:})_{q} \left( \frac{\partial L}{\partial \sigma_{pq}} - \sum_{j=1}^{n_{1}} \frac{\partial L}{\partial \sigma_{pj}} \sigma(\mathbf{z}_{p,:})_{j} \right) \end{align*}

in general

LZ=Σ(LΣ(LΣΣ)1)\frac{\partial L}{\partial \mathbf{Z}} = \mathbf{\Sigma} \odot \left( \frac{\partial L}{\partial \mathbf{\Sigma}} - \left( \frac{\partial L}{\partial \mathbf{\Sigma}} \odot \mathbf{\Sigma} \right) \mathbf{1} \right)

where 1Rn1×n1\mathbf{1} \in \mathbb{R}^{n_{1} \times n_{1}}.

example loss function

def loss_function(a: np.ndarray) -> np.ndarray:
    return np.sum(a ** 2)
loss_soft_grad = jacobian(lambda z: loss_function(my_softmax_2(z)))(Z)
loss_soft_grad.shape
(1000, 5)
loss_grad = jacobian(loss_function)(soft_out_2)
loss_grad.shape
(1000, 5)
def my_loss_soft_der(loss_grad: np.ndarray, soft: np.ndarray) -> np.ndarray:
    return soft * (loss_grad - np.sum(loss_grad * soft, axis=1, keepdims=True))

my_loss_soft_grad = my_loss_soft_der(loss_grad, soft_out_2)
my_loss_soft_grad.shape
(1000, 5)
mape(
    my_loss_soft_grad,
    loss_soft_grad
)
6.197734362268896e-07

Underflow and Overflow

z_flow = Z[:5] * 100
z_flow.shape
(5, 5)
my_softmax_2(z_flow)
/opt/hostedtoolcache/Python/3.12.12/x64/lib/python3.12/site-packages/autograd/tracer.py:54: RuntimeWarning: overflow encountered in exp
  return f_raw(*args, **kwargs)
/tmp/ipykernel_2426/706199401.py:3: RuntimeWarning: invalid value encountered in divide
  return exp / np.sum(exp, axis=1, keepdims=True)
array([[ 0., nan, nan, nan, 0.], [ 0., 0., 0., 0., nan], [ 1., 0., 0., 0., 0.], [ 0., nan, 0., 0., 0.], [ 0., nan, 0., nan, 0.]])

If cc is very negative, then exp(c)\exp(c) will underflow. This means the denominator of the softmax will become 0, so the final result is undefined. When cc is very large and positive, exp(c)\exp(c) will overflow, again resulting in the expression as a whole being undefined.

Reference: Goodfellow, I. J., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press, p. 81.

To fix this, we can use a subtract.

σ(z)=σ(zmaxizi)\sigma(\mathbf{z}) = \sigma(\mathbf{z} - \max_{i} \mathbf{z}_{i})

this analytically does not change, because the probability of zz is equal to the probability of zmaxzz - \max z.

my_softmax_2(z_flow - np.max(z_flow, axis=1, keepdims=True))
array([[0.00000000e+000, 7.17509597e-066, 1.38389653e-087, 1.00000000e+000, 0.00000000e+000], [0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 1.00000000e+000], [1.00000000e+000, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000], [0.00000000e+000, 1.00000000e+000, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000], [0.00000000e+000, 1.00000000e+000, 0.00000000e+000, 3.69388307e-196, 0.00000000e+000]])

Appendix

when r=qr = q:

σqzr=q=exp(zq)(k=1n1exp(zk))exp(zq)2(k=1n1exp(zk))2=exp(zq)(k=1n1exp(zk)exp(zq))(k=1n1exp(zk))2=exp(zq)k=1n1exp(zk)(k=1n1exp(zk)exp(zq)k=1n1exp(zk))=σ(z)q(k=1n1exp(zk)k=1n1exp(zk)exp(zq)k=1n1exp(zk))=σ(z)q(1σ(z)q)\begin{align*} \frac{\partial \sigma_{q}}{\partial z_{r=q}} &= \frac{\exp(z_{q})(\sum_{k=1}^{{n_{1}}}\exp(z_{k})) - \exp(z_{q})^{2}} {(\sum_{k=1}^{{n_{1}}}\exp(z_{k}))^{2}} \\ &= \frac{\exp(z_{q})(\sum_{k=1}^{{n_{1}}}\exp(z_{k}) - \exp(z_{q}))} {(\sum_{k=1}^{{n_{1}}}\exp(z_{k}))^{2}} \\ &= \frac{\exp(z_{q})}{\sum_{k=1}^{{n_{1}}}\exp(z_{k})} \left( \frac{\sum_{k=1}^{{n_{1}}}\exp(z_{k}) - \exp(z_{q})} {\sum_{k=1}^{{n_{1}}}\exp(z_{k})} \right) \\ &= \sigma(\mathbf{z})_{q} \left( \frac{\sum_{k=1}^{{n_{1}}}\exp(z_{k})}{\sum_{k=1}^{{n_{1}}}\exp(z_{k})} - \frac{\exp(z_{q})}{\sum_{k=1}^{{n_{1}}}\exp(z_{k})} \right) \\ &= \sigma(\mathbf{z})_{q} (1 - \sigma(\mathbf{z})_{q}) \end{align*}

when rqr \neq q:

σqzrq=exp(zq)exp(zr)(k=1n1exp(zk))2=exp(zq)k=1n1exp(zk)(exp(zr)k=1n1exp(zk))=σ(z)qσ(z)r\begin{align*} \frac{\partial \sigma_{q}}{\partial z_{r \neq q}} &= - \frac{\exp(z_{q}) \exp(z_{r})} {(\sum_{k=1}^{{n_{1}}}\exp(z_{k}))^{2}} \\ &= - \frac{\exp(z_{q})}{\sum_{k=1}^{{n_{1}}}\exp(z_{k})} \left( \frac{\exp(z_{r})}{\sum_{k=1}^{{n_{1}}}\exp(z_{k})} \right) \\ &= - \sigma(\mathbf{z})_{q} \sigma(\mathbf{z})_{r} \end{align*}