Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Gradients and activations functions

In this notebook we will explore the derivatives of different activation functions such as ReLU, Sigmoid and Tanh. In addition, we will explore the derivatives of the dense layer, some of which are already known, but there is a new derivative that needs to be explored.

from autograd import jacobian, numpy as np

from platform import python_version
python_version()
'3.12.12'
# This cell imports numpy_mape 
# if you are running this notebook locally 
# or from Google Colab.

import os
import sys

module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

try:
    from tools.numpy_metrics import np_mape as mape
    print('mape imported locally.')
except ModuleNotFoundError:
    import subprocess

    repo_url = 'https://raw.githubusercontent.com/PilotLeoYan/inside-deep-learning/main/content/tools/numpy_metrics.py'
    local_file = 'numpy_metrics.py'
    
    subprocess.run(['wget', repo_url, '-O', local_file], check=True)
    try:
        from numpy_metrics import np_mape as mape # type: ignore
        print('mape imported from GitHub.')
    except Exception as e:
        print(e)
mape imported locally.

forward propagation

M: int = 100 # number of samples
N: int = 8 # number of input features
NO: int = 4 # number of output features
X = np.random.randn(M, N)
X.shape
(100, 8)

dense

parameters

W(k)Rnk1×nkb(k)Rnk\begin{align*} \mathbf{W}^{(k)} &\in \mathbb{R}^{n_{k-1} \times n_{k}} \\ \mathbf{b}^{(k)} &\in \mathbb{R}^{n_{k}} \end{align*}
bias: np.ndarray = np.random.randn(NO)
bias.shape
(4,)
weight: np.ndarray = np.random.randn(N, NO)
weight.shape
(8, 4)

weighted sum

Z(k)(A(k1))=A(k1)W(k)+b(k)Z(k):Rm×nk1Rm×nk\mathbf{Z}^{(k)} (\mathbf{A}^{(k-1)}) = \mathbf{A}^{(k-1)} \mathbf{W}^{(k)} + \mathbf{b}^{(k)} \\ \mathbf{Z}^{(k)} : \mathbb{R}^{m \times n_{k-1}} \rightarrow \mathbb{R}^{m \times n_{k}}
def weighted_sum(input: np.ndarray, weight: np.ndarray, 
                 bias: np.ndarray) -> np.ndarray:
    return np.matmul(input, weight) + bias

z = weighted_sum(X, weight, bias)
z.shape
(100, 4)

activation functions

For any activation function

A(k)(Z(k))=f(Z(k))A(k):Rm×nkRm×nk\mathbf{A}^{(k)} (\mathbf{Z}^{(k)}) = f(\mathbf{Z}^{(k)}) \\ \mathbf{A}^{(k)}: \mathbb{R}^{m \times n_{k}} \rightarrow \mathbb{R}^{m \times n_{k}}

ReLU

ReLU(z)=max(z,0)R\text{ReLU}(z) = \max(z, 0) \in \mathbb{R}

where zRz \in \mathbb{R}.

ReLU(Z(k))=[ReLU(z11(k))ReLU(z1nk(k))ReLU(zm1(k))ReLU(zmnk(k))]\text{ReLU} (\mathbf{Z}^{(k)}) = \begin{bmatrix} \text{ReLU}(z_{11}^{(k)}) & \cdots & \text{ReLU}(z_{1n_{k}}^{(k)}) \\ \vdots & \ddots & \vdots \\ \text{ReLU}(z_{m1}^{(k)}) & \cdots & \text{ReLU}(z_{mn_{k}}^{(k)}) \end{bmatrix}
def relu(z: np.ndarray) -> np.ndarray:
    return z * (z > 0)

relu_pred = relu(z)
relu_pred.shape
(100, 4)

Sigmoid

Sigmoid(z)=11+exp(z)\text{Sigmoid}(z) = \frac{1}{1 + \exp(-z)}
Sigmoid(Z(k))=[Sigmoid(z11(k))Sigmoid(z1nk(k))Sigmoid(zm1(k))Sigmoid(zmnk(k))]\text{Sigmoid} (\mathbf{Z}^{(k)}) = \begin{bmatrix} \text{Sigmoid}(z_{11}^{(k)}) & \cdots & \text{Sigmoid}(z_{1n_{k}}^{(k)}) \\ \vdots & \ddots & \vdots \\ \text{Sigmoid}(z_{m1}^{(k)}) & \cdots & \text{Sigmoid}(z_{mn_{k}}^{(k)}) \end{bmatrix}
def sigmoid(z: np.ndarray) -> np.ndarray:
    return 1 / (1 + np.exp(-z))

sigmoid_pred = sigmoid(z)
sigmoid_pred.shape
(100, 4)

Tanh

tanh(z)=1exp(2z)1+exp(2z)\tanh(z) = \frac{1 - \exp(-2 z)}{1 + \exp(-2 z)}
tanh(Z(k))=[tanh(z11(k))tanh(z1nk(k))tanh(zm1(k))tanh(zmnk(k))]\tanh (\mathbf{Z}^{(k)}) = \begin{bmatrix} \tanh(z_{11}^{(k)}) & \cdots & \tanh(z_{1n_{k}}^{(k)}) \\ \vdots & \ddots & \vdots \\ \tanh(z_{m1}^{(k)}) & \cdots & \tanh(z_{mn_{k}}^{(k)}) \end{bmatrix}
def tanh(z: np.ndarray) -> np.ndarray:
    exp = np.exp(-2 * z)
    return (1 - exp) / (1 + exp)

tanh_pred = tanh(z)
tanh_pred.shape
(100, 4)

gradients

activation function gradients

For any activation function A\mathbf{A}, its derivative with respect to the weighted sum input is

A(k)Z(k)R(m×nk)×(m×nk)\frac{\partial \mathbf{A}^{(k)}} {\partial \mathbf{Z}^{(k)}} \in \mathbb{R}^{(m \times n_{k}) \times (m \times n_{k})}

ReLU derivative

relu_grad = jacobian(relu)(z)
relu_grad.shape
(100, 4, 100, 4)
dmax(z,0)dz={1if z>00if z0\frac{\mathrm{d} \max(z,0)}{\mathrm{d} z} = \begin{cases} 1 & \text{if } z > 0 \\ 0 & \text{if } z \leq 0 \end{cases}
apq(k)zij(k)={aij(k)zij(k) if p=i,q=j0 otherwise\frac{\partial a_{pq}^{(k)} }{\partial z_{ij}^{(k)}} = \begin{cases} \frac{\partial a_{ij}^{(k)}} {\partial z_{ij}^{(k)}} & \text{ if } p=i, q=j \\ 0 & \text{ otherwise} \end{cases}

for all p,i=1,,mp, i = 1, \ldots, m and q,j=1,,nkq, j = 1, \ldots, n_{k}.

aij(k)zij(k)={1if zij(k)>00if zij(k)0\frac{\partial a_{ij}^{(k)}} {\partial z_{ij}^{(k)}} = \begin{cases} 1 & \text{if } z_{ij}^{(k)} > 0 \\ 0 & \text{if } z_{ij}^{(k)} \leq 0 \end{cases}

we can take advantage of the same activation output

aij(k)zij(k)={1if aij(k)>00if aij(k)0\frac{\partial a_{ij}^{(k)}} {\partial z_{ij}^{(k)}} = \begin{cases} 1 & \text{if } a_{ij}^{(k)} > 0 \\ 0 & \text{if } a_{ij}^{(k)} \leq 0 \end{cases}
def relu_jacobian(relu_out: np.ndarray) -> np.ndarray:
    m, n = relu_out.shape # m samples, n features
    out = np.zeros((m, n, m, n))
    for i in range(m):
        for j in range(n):
            out[i, j, i, j] = 1 if relu_out[i, j] > 0 else 0
    return out

my_relu_grad = relu_jacobian(relu_pred)
my_relu_grad.shape
(100, 4, 100, 4)
mape(
    my_relu_grad,
    relu_grad
)
0.0

Sigmoid gradient

sigmoid_grad = jacobian(sigmoid)(z)
sigmoid_grad.shape
(100, 4, 100, 4)
dSigmoiddz=Sigmoid(z)(1Sigmoid(z))\frac{\mathrm{d} \text{Sigmoid}}{\mathrm{d} z} = \text{Sigmoid}(z) \left( 1 - \text{Sigmoid}(z) \right)

see Appendix.

apq(k)zij(k)={aij(k)(1aij(k))if p=i,q=j0otherwise\frac{\partial a_{pq}^{(k)}} {\partial z_{ij}^{(k)}} = \begin{cases} a_{ij}^{(k)} (1 - a_{ij}^{(k)}) & \text{if }p = i, q = j \\ 0 & \text{otherwise} \end{cases}
def sigmoid_jacobian(sigm_out: np.ndarray) -> np.ndarray:
    m, n = sigm_out.shape # m samples, n features
    out = np.zeros((m, n, m, n))
    for i in range(m):
        for j in range(n):
            out[i, j, i, j] = sigm_out[i, j] * (1 - sigm_out[i, j])
    return out

my_sigmoid_grad = sigmoid_jacobian(sigmoid_pred)
my_sigmoid_grad.shape
(100, 4, 100, 4)
mape(
    my_sigmoid_grad,
    sigmoid_grad
)
1.491424107779037e-15

Tanh gradient

dtanhdz=1tanh2(z)\frac{\mathrm{d} \tanh}{\mathrm{d} z} = 1 - \tanh^{2}(z)
apq(k)zij(k)={1(aij(k))2if p=i,q=j0otherwise\frac{\partial a_{pq}^{(k)}} {\partial z_{ij}^{(k)}} = \begin{cases} 1 - (a_{ij}^{(k)})^{2} & \text{if } p = i, q = j \\ 0 & \text{otherwise} \end{cases}
tanh_grad = jacobian(tanh)(z)
tanh_grad.shape
(100, 4, 100, 4)
def tanh_jacobian(tanh_out: np.ndarray) -> np.ndarray:
    m, n = tanh_out.shape # m samples, n features
    out = np.zeros((m, n, m, n))
    for i in range(m):
        for j in range(n):
            out[i, j, i, j] = 1 - tanh_out[i, j] ** 2
    return out

my_tanh_grad = tanh_jacobian(tanh_pred)
my_tanh_grad.shape
(100, 4, 100, 4)
mape(
    my_tanh_grad,
    tanh_grad
)
3.487132308931419e-12

gradients with loss function

Our goal is to compute

LA(k),LW(k),Lb(k)\frac{\partial L}{\partial \mathbf{A}^{(k)}}, \frac{\partial L}{\partial \mathbf{W}^{(k)}}, \frac{\partial L}{\partial \mathbf{b}^{(k)}}

loss function

For any loss function LL, for example

L(Y^)=i=1mj=1noy^ij2L(\hat{\mathbf{Y}}) = \sum_{i=1}^{m} \sum_{j=1}^{n_{\text{o}}} \hat{y}_{ij}^{2}
def loss(y_pred: np.ndarray) -> float:
    return np.sum(y_pred ** 2)
LY^=2Y^\frac{\partial L} {\partial \hat{\mathbf{Y}}} = 2 \hat{\mathbf{Y}}

activation function gradients

LZ(k)=LA(k)A(k)Z(k)=ΔA(k)Z(k)\begin{align*} \frac{\partial L}{\partial \mathbf{Z}^{(k)}} &= {\color{Orange} {\frac{\partial L}{\partial \mathbf{A}^{(k)}}}} {\color{Cyan} {\frac{\partial \mathbf{A}^{(k)}} {\partial \mathbf{Z}^{(k)}}}} \\ &= {\color{Orange} {\mathbf{\Delta}}} {\color{Cyan} {\frac{\partial \mathbf{A}^{(k)}} {\partial \mathbf{Z}^{(k)}}}} \end{align*}

ReLU derivative

loss_relu_grad = jacobian(lambda z: loss(relu(z)))(z)
loss_relu_grad.shape
(100, 4)
delta_relu = jacobian(loss)(relu_pred)
delta_relu.shape
(100, 4)
Lzpq(k)=i=0mj=0noLaij(k)aij(k)zpq(k)\frac{\partial L}{\partial z^{(k)}_{pq}} = \sum_{i=0}^{m} \sum_{j=0}^{n_{o}} \frac{\partial L}{\partial a_{ij}^{(k)}} \frac{\partial a_{ij}^{(k)}}{\partial z_{pq}^{(k)}}

For the case i=pi=p and j=qj=q:

apq(k)zpq(k)={1if zpq(k)>00if zpq(k)0\frac{\partial a_{pq}^{(k)}} {\partial z_{pq}^{(k)}} = \begin{cases} 1 & \text{if } z_{pq}^{(k)} > 0 \\ 0 & \text{if } z_{pq}^{(k)} \leq 0 \end{cases}

For the case ipi \neq p or jqj \neq q:

aij(k)zpq(k)=0\frac{\partial a_{ij}^{(k)}} {\partial z_{pq}^{(k)}} = 0

therefore

Lzpq(k)=Lapq(k)apq(k)zpq(k)=δpq{1if apq(k)>00if apq(k)0\begin{align*} \frac{\partial L}{\partial z^{(k)}_{pq}} &= \frac{\partial L}{\partial a_{pq}^{(k)}} \frac{\partial a_{pq}^{(k)}}{\partial z_{pq}^{(k)}} \\ &= \delta_{pq} \begin{cases} 1 & \text{if } a_{pq}^{(k)} > 0 \\ 0 & \text{if } a_{pq}^{(k)} \leq 0 \end{cases} \end{align*}

for all p=mp = m and q=nkq = n_{k}.

in general

LZ(k)=LA(k)A(k)Z(k)=Δ(A(k)>0)\begin{align*} \frac{\partial L}{\partial \mathbf{Z}^{(k)}} &= \frac{\partial L}{\partial \mathbf{A}^{(k)}} \odot \frac{\partial \mathbf{A}^{(k)}} {\partial \mathbf{Z}^{(k)}} \\ &= \mathbf{\Delta} \odot \left( \mathbf{A}^{(k)} > 0 \right) \end{align*}

Note: Remember that we can use the same result of the activation function for backpropagation.

def loss_relu_der(delta: np.ndarray, relu_out: np.ndarray) -> np.ndarray:
    return np.multiply(delta, 1 * (relu_out > 0))

my_loss_relu_grad = loss_relu_der(delta_relu, relu_pred)
my_loss_relu_grad.shape
(100, 4)
mape(
    my_loss_relu_grad,
    loss_relu_grad
)
0.0

Sigmoid

loss_sigmoid_grad = jacobian(lambda z: loss(sigmoid(z)))(z)
loss_sigmoid_grad.shape
(100, 4)
delta_sigmoid = jacobian(loss)(sigmoid_pred)
delta_sigmoid.shape
(100, 4)
Lzpq(k)=i=0mj=0noLaij(k)aij(k)zpq(k)\frac{\partial L}{\partial z^{(k)}_{pq}} = \sum_{i=0}^{m} \sum_{j=0}^{n_{o}} \frac{\partial L}{\partial a_{ij}^{(k)}} \frac{\partial a_{ij}^{(k)}}{\partial z_{pq}^{(k)}}

For the case i=pi=p and j=qj=q:

apq(k)zpq(k)=apq(k)(1apq(k))\frac{\partial a_{pq}^{(k)}} {\partial z_{pq}^{(k)}} = a_{pq}^{(k)} \left( 1 - a_{pq}^{(k)} \right)

For the case ipi \neq p or jqj \neq q:

aij(k)zpq(k)=0\frac{\partial a_{ij}^{(k)}} {\partial z_{pq}^{(k)}} = 0

therefore

Lzpq(k)=Lapq(k)apq(k)zpq(k)=δpq(apq(k)(1apq(k)))\begin{align*} \frac{\partial L}{\partial z^{(k)}_{pq}} &= \frac{\partial L}{\partial a_{pq}^{(k)}} \frac{\partial a_{pq}^{(k)}}{\partial z_{pq}^{(k)}} \\ &= \delta_{pq} \left( a_{pq}^{(k)} \left( 1 - a_{pq}^{(k)} \right) \right) \end{align*}

for all p=mp = m and q=nkq = n_{k}.

in general

LZ(k)=LA(k)A(k)Z(k)=Δ(A(k)(1A(k)))\begin{align*} \frac{\partial L}{\partial \mathbf{Z}^{(k)}} &= \frac{\partial L}{\partial \mathbf{A}^{(k)}} \odot \frac{\partial \mathbf{A}^{(k)}} {\partial \mathbf{Z}^{(k)}} \\ &= \mathbf{\Delta} \odot \left( \mathbf{A}^{(k)} \left( \mathbf{1} - \mathbf{A}^{(k)} \right) \right) \end{align*}
def loss_sigmoid_der(delta: np.ndarray, sigm_out: np.ndarray) -> np.ndarray:
    return np.multiply(delta, sigm_out * (1 - sigm_out))

my_loss_sigmoid_grad = loss_sigmoid_der(delta_sigmoid, sigmoid_pred)
my_loss_sigmoid_grad.shape
(100, 4)
mape(
    my_loss_sigmoid_grad,
    loss_sigmoid_grad
)
5.967547612307269e-13

Tanh

loss_tanh_grad = jacobian(lambda z: loss(tanh(z)))(z)
loss_tanh_grad.shape
(100, 4)
delta_tanh = jacobian(loss)(tanh_pred)
delta_tanh.shape
(100, 4)
Lzpq(k)=i=0mj=0noLaij(k)aij(k)zpq(k)\frac{\partial L}{\partial z^{(k)}_{pq}} = \sum_{i=0}^{m} \sum_{j=0}^{n_{o}} \frac{\partial L}{\partial a_{ij}^{(k)}} \frac{\partial a_{ij}^{(k)}}{\partial z_{pq}^{(k)}}

For the case i=pi=p and j=qj=q:

apq(k)zpq(k)=1(apq(k))2\frac{\partial a_{pq}^{(k)}} {\partial z_{pq}^{(k)}} = 1 - (a_{pq}^{(k)})^{2}

For the case ipi \neq p or jqj \neq q:

aij(k)zpq(k)=0\frac{\partial a_{ij}^{(k)}} {\partial z_{pq}^{(k)}} = 0

therefore

Lzpq(k)=Lapq(k)apq(k)zpq(k)=δpq(1(apq(k))2)\begin{align*} \frac{\partial L}{\partial z^{(k)}_{pq}} &= \frac{\partial L}{\partial a_{pq}^{(k)}} \frac{\partial a_{pq}^{(k)}}{\partial z_{pq}^{(k)}} \\ &= \delta_{pq} \left( 1 - (a_{pq}^{(k)})^{2} \right) \end{align*}

for all p=mp = m and q=nkq = n_{k}.

in general

LZ(k)=LA(k)A(k)Z(k)=Δ(1(A(k))2)\begin{align*} \frac{\partial L}{\partial \mathbf{Z}^{(k)}} &= \frac{\partial L}{\partial \mathbf{A}^{(k)}} \odot \frac{\partial \mathbf{A}^{(k)}} {\partial \mathbf{Z}^{(k)}} \\ &= \mathbf{\Delta} \odot \left( \mathbf{1} - \left( \mathbf{A}^{(k)} \right)^{2} \right) \end{align*}
def loss_tanh_der(delta: np.ndarray, tanh_out: np.ndarray) -> np.ndarray:
    return np.multiply(delta, (1 - tanh_out ** 2))

my_loss_tanh_grad = loss_tanh_der(delta_tanh, tanh_pred)
my_loss_tanh_grad.shape
(100, 4)
mape(
    my_loss_tanh_grad,
    loss_tanh_grad
)
2.3315927023923187e-09

dense gradients

In this section we will only use the ReLU function because we don’t want to repeat the same code for each activation function, and Sigmoid and Tanh have vanishing gradients.

LZ(k)=Δ{\color{Cyan} {\frac{\partial L} {\partial \mathbf{Z}^{(k)}}}} = {\color{Cyan} {\mathbf{\Delta}}}

bias gradient

bias_relu_grad = jacobian(lambda b: loss(relu(weighted_sum(X, weight, b))))(bias)
bias_relu_grad.shape
(4,)
Lb(k)=1Δ\frac{\partial L}{\partial \mathbf{b}^{(k)}} = \mathbf{1 \Delta}

where 1Rm\mathbf{1} \in \mathbb{R}^{m}.

my_bias_relu_grad = np.sum(my_loss_relu_grad, axis=0)
my_bias_relu_grad.shape
(4,)
mape(
    my_bias_relu_grad,
    bias_relu_grad
)
0.0

weight gradient

weight_relu_grad = jacobian(lambda w: loss(relu(weighted_sum(X, w, bias))))(weight)
weight_relu_grad.shape
(8, 4)
LW(k)=(A(k1))Δ\frac{\partial L}{\partial \mathbf{W}^{(k)}} = (\mathbf{A}^{(k-1)})^{\top} \mathbf{\Delta}

where A(k1)Rm×nk1\mathbf{A}^{(k-1)} \in \mathbb{R}^{m \times n_{k-1}}.

my_weight_relu_grad = X.T @ my_loss_relu_grad
my_weight_relu_grad.shape
(8, 4)
mape(
    my_weight_relu_grad,
    weight_relu_grad
)
0.0

input gradient

input_relu_grad = jacobian(lambda x: loss(relu(weighted_sum(x, weight, bias))))(X)
input_relu_grad.shape
(100, 8)
Lapq(k1)=i=1mj=1noLzij(k)zij(k)apq(k1)\frac{\partial L}{\partial a_{pq}^{(k-1)}} = \sum_{i=1}^{m} \sum_{j=1}^{n_{o}} \frac{\partial L}{\partial z_{ij}^{(k)}} \frac{\partial z_{ij}^{(k)}}{\partial a_{pq}^{(k-1)}}
zij(k)apq(k1)={wqj(k)if i=p0if iq\frac{\partial z_{ij}^{(k)}}{\partial a_{pq}^{(k-1)}} = \begin{cases} w_{qj}^{(k)} & \text{if } i=p \\ 0 & \text{if } i \neq q \end{cases}

therefore

Lapq(k1)=j=1noLzpj(k)zpj(k)apq(k1)=δp,:(wq,:(k))\begin{align*} \frac{\partial L}{\partial a_{pq}^{(k-1)}} &= \sum_{j=1}^{n_{o}} \frac{\partial L}{\partial z_{pj}^{(k)}} \frac{\partial z_{pj}^{(k)}}{\partial a_{pq}^{(k-1)}} \\ &= \delta_{p,:} \left( w_{q,:}^{(k)} \right)^{\top} \end{align*}

for all p=1,,mp=1, \ldots, m and q=1,,nk1q = 1, \ldots, n_{k-1}.

in general

LA(k1)=Δ(W(k))\frac{\partial L}{\partial \mathbf{A}^{(k-1)}} = \mathbf{\Delta} \left( \mathbf{W}^{(k)} \right)^{\top}
my_input_relu_grad = my_loss_relu_grad @ weight.T
my_input_relu_grad.shape
(100, 8)
mape(
    my_input_relu_grad,
    input_relu_grad
)
0.0

Vanishing Gradients

The leakage gradient problem is a phenomenon that occurs during training. It occurs when the gradients used to update the network are very small, making it difficult for the network to efficiently update its weights. This is a numerical stability problem.

from matplotlib import pyplot as plt
x = np.arange(-8.0, 8.1, 0.1)
x.shape
(161,)
ones = np.ones_like(x)

ReLU

relu_out = relu(x)
relu_der = loss_relu_der(ones, relu_out)
plt.plot(x, relu_out, label='ReLU')
plt.plot(x, relu_der, label='gradient')
plt.grid(True)
plt.legend()
plt.show()
<Figure size 640x480 with 1 Axes>

Softmax

soft_out = sigmoid(x)
soft_der = loss_sigmoid_der(ones, soft_out)
plt.plot(x, soft_out, label='sigmoid')
plt.plot(x, soft_der, label='gradient')
plt.grid(True)
plt.legend()
plt.show()
<Figure size 640x480 with 1 Axes>

Tanh

tanh_out = tanh(x)
tanh_der = loss_tanh_der(ones, tanh_out)
plt.plot(x, tanh_out, label='tanh')
plt.plot(x, tanh_der, label='gradient')
plt.grid(True)
plt.legend()
plt.show()
<Figure size 640x480 with 1 Axes>

We can observe that the sigmoid and tanh gradients tend to be 0 when the inputs are at the edges. ReLU is the most numerically stable of these functions.

Appendix

dSigmoiddz=exp(z)(1+exp(z))2=11+exp(z)(exp(z)1+exp(z))=11+exp(z)(1+exp(z)1+exp(z)11+exp(z))=11+exp(z)(111+exp(z))=Sigmoid(z)(1Sigmoid(z))\begin{align*} \frac{\mathrm{d} \text{Sigmoid}}{\mathrm{d} z} &= \frac{\exp(-z)}{\left(1 + \exp(-z) \right)^{2}} \\ &= \frac{1}{1 + \exp(-z)} \left( \frac{\exp(-z)}{1 + \exp(-z)} \right) \\ &= \frac{1}{1 + \exp(-z)} \left( \frac{1 + \exp(-z)}{1 + \exp(-z)} - \frac{1}{1 + \exp(-z)} \right) \\ &= \frac{1}{1 + \exp(-z)} \left( 1 - \frac{1}{1 + \exp(-z)} \right) \\ &= \text{Sigmoid}(z) \left(1 - \text{Sigmoid}(z) \right) \end{align*}