Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

1.4 - Weight Decay

Let’s continue from multioutput linear regression. Now let’s incorporate the 2\ell_2 regularization (L2L_{2}) into our model. 2\ell_2 regularization is a technique that prevents models from overfitting by penalizing large weight values.

Purpose of this Notebook:

  1. Create a dataset

  2. Incorporate 2\ell_2 regularization into our Perceptron from scratch

  3. Train our Perceptron

  4. Compare our Perceptron to the one prebuilt by PyTorch

🚨 This notebook is a copy of 1.3 - Multioutput Linear Regression. Only modified parameters update, the rest is unchanged.

Setup

print('Start package installation...')
Start package installation...
%%capture
%pip install torch
%pip install scikit-learn
print('Packages installed successfully!')
Packages installed successfully!
import torch
from torch import nn

from platform import python_version
python_version(), torch.__version__
('3.12.12', '2.9.0+cu128')
device = 'cpu'
if torch.cuda.is_available():
    device = 'cuda'
device
'cpu'
torch.set_default_dtype(torch.float64)
def add_to_class(Class):  
    """Register functions as methods in created class."""
    def wrapper(obj): setattr(Class, obj.__name__, obj)
    return wrapper

Dataset

create dataset

XRm×nYRm×no\begin{align*} \mathbf{X} &\in \mathbb{R}^{m \times n} \\ \mathbf{Y} &\in \mathbb{R}^{m \times n_{o}} \end{align*}
from sklearn.datasets import make_regression
import random

M: int = 10_100 # number of samples
N: int = 6 # number of input features
NO: int = 3 # number of output features

X, Y = make_regression(
    n_samples=M, 
    n_features=N, 
    n_targets=NO, 
    n_informative=N - 1,
    bias=random.random(),
    noise=1
)

print(X.shape)
print(Y.shape)
(10100, 6)
(10100, 3)

split dataset

X_train = torch.tensor(X[:100], device=device)
Y_train = torch.tensor(Y[:100], device=device)
X_train.shape, Y_train.shape
(torch.Size([100, 6]), torch.Size([100, 3]))
X_valid = torch.tensor(X[100:], device=device)
Y_valid = torch.tensor(Y[100:], device=device)
X_valid.shape, Y_valid.shape
(torch.Size([10000, 6]), torch.Size([10000, 3]))

delete raw dataset

del X
del Y

Scratch model

The only thing we are going to modify is the way in which the model weights are updated. The rest, such as parameter initialization and model training remain unchanged.

Linear Regression model

class LinearRegression:
    def __init__(self, n_features: int, out_features: int, lambd: float):
        self.w = torch.randn(n_features, out_features, device=device)
        self.b = torch.randn(out_features, device=device)
        self.lambd = lambd

    def copy_params(self, torch_layer: torch.nn.modules.linear.Linear):
        """
        Copy the parameters from a module.linear to this model.

        Args:
            torch_layer: Pytorch module from which to copy the parameters.
        """
        self.b.copy_(torch_layer.bias.detach().clone())
        self.w.copy_(torch_layer.weight.T.detach().clone())

    def predict(self, x: torch.Tensor) -> torch.Tensor:
        """
        Predict the output for input x

        Args:
            x: Input tensor of shape (n_samples, n_features).

        Returns:
            y_pred: Predicted output tensor of shape (n_samples, out_features).
        """
        return torch.matmul(x, self.w) + self.b

    def mse_loss(self, y_true: torch.Tensor, y_pred: torch.Tensor):
        """
        MSE loss function between target y_true and y_pred.

        Args:
            y_true: Target tensor of shape (n_samples, out_features).
            y_pred: Predicted tensor of shape (n_samples, out_features).

        Returns:
            loss: MSE loss between predictions and true values.
        """
        return ((y_pred - y_true)**2).mean().item()

    def evaluate(self, x: torch.Tensor, y_true: torch.Tensor):
        """
        Evaluate the model on input x and target y_true using MSE.

        Args:
            x: Input tensor of shape (n_samples, n_features).
            y_true: Target tensor of shape (n_samples, out_features).

        Returns:
            loss: MSE loss between predictions and true values.
        """
        y_pred = self.predict(x)
        return self.mse_loss(y_true, y_pred)

    def fit(self, x_train: torch.Tensor, y_train: torch.Tensor, 
        epochs: int, lr: float, batch_size: int, 
        x_valid: torch.Tensor, y_valid: torch.Tensor):
        """
        Fit the model using gradient descent.
        
        Args:
            x_train: Input tensor of shape (n_samples, n_features).
            y_train: Target tensor of shape (n_samples, out_features).
            epochs: Number of epochs to fit.
            lr: learning rate.
            batch_size: Int number of batch.
            x_valid: Input tensor of shape (n_valid_samples, n_features).
            y_valid: Target tensor of shape (n_valid_samples, out_features)
        """
        for epoch in range(epochs):
            loss = []
            for batch in range(0, len(y_train), batch_size):
                end_batch = batch + batch_size

                y_pred = self.predict(x_train[batch:end_batch])

                loss.append(self.mse_loss(
                    y_train[batch:end_batch], 
                    y_pred
                ))

                self.update(
                    x_train[batch:end_batch], 
                    y_train[batch:end_batch], 
                    y_pred, 
                    lr
                )

            loss = round(sum(loss) / len(loss), 4)
            loss_v = round(self.evaluate(x_valid, y_valid), 4)
            print(f'epoch: {epoch} - MSE: {loss} - MSE_v: {loss_v}')

Parameters update

objective function

Now instead of training the model with the loss function LL, we are going to use the objective function JJ. Typically our objective function is as follows

J(Y^,θ)=L(Y^)+regularization(θ)J(\hat{\mathbf{Y}}, \mathbf{\theta}) = L(\hat{\mathbf{Y}}) + \text{regularization}(\theta)

where θ\mathbf{\theta} is an arbitrary parameter.

Note: Do not use the objective function to evaluate the model.

L2 regularization

As a weight decay technique, we will 2\ell_2 or L2L_{2}

2(θ)=λ2θ22\ell_2(\mathbf{\theta}) = \frac{\lambda}{2} \left\| \mathbf{\theta} \right\|^{2}_{2}

where commonly θRn\mathbf{\theta} \in \mathbb{R}^{n}.

🚨 Tecnically, weight decay and 2\ell_{2} are different in a special scenarios, but for standard GD (gradient descent) they are mathematically equivalent.

Note: λR\lambda \in \mathbb{R} is as a hyperparameter, because it is a parameter set by the developer (you) not by the model.

But we have WRn×no\mathbf{W} \in \mathbb{R}^{n \times n_{o}}, then we need to do an equivalence operation.

2(W)=λ2i=1nj=1nowij2=λ2sum(W2)\begin{align*} \ell_2(\mathbf{W}) &= \frac{\lambda}{2} \sum_{i=1}^{n} \sum_{j=1}^{n_{o}} w_{ij}^{2} \\ &= \frac{\lambda}{2} \text{sum} \left( \mathbf{W}^{2} \right) \end{align*}

where A2{\mathbf{A}}^2 is element-wise power A2=AA{\mathbf{A}}^2 = \mathbf{A} \odot \mathbf{A}.

Note: Typically, weight decay only affects the weight, but the bias.

objective function derivative

Jwrs=Lwrs+2wrs\frac{\partial J}{\partial w_{rs}} = \frac{\partial L}{\partial w_{rs}} + \frac{\partial \ell_2}{\partial w_{rs}}
2wrs=λ2i=1nj=1nowrs(wij2)=λi=1nj=1nowijwijwrs=λi=1nj=1nowijδirδjs=λi=1nwrjδjs=λwrs\begin{align*} \frac{\partial \ell_2}{\partial w_{rs}} &= \frac{\lambda}{2} \sum_{i=1}^{n} \sum_{j=1}^{n_{o}} \frac{\partial}{\partial w_{rs}} \left( w_{ij}^{2} \right) \\ &= \lambda \sum_{i=1}^{n} \sum_{j=1}^{n_{o}} w_{ij} \frac{\partial w_{ij}}{\partial w_{rs}} \\ &= \lambda \sum_{i=1}^{n} \sum_{j=1}^{n_{o}} w_{ij} \delta_{ir} \delta_{js} \\ &= \lambda \sum_{i=1}^{n} w_{rj} \delta_{js} \\ &= \lambda w_{rs} \end{align*}

for r=1,,nr = 1, \ldots, n, and s=1,,nos = 1, \ldots, n_{o}.

Vectorzied form

2W=λW\frac{\partial \ell_2}{\partial \mathbf{W}} = \lambda \mathbf{W}

Remark: W2Rn×no\nabla_{\mathbf{W}}\ell_2 \in \mathbb{R}^{n \times n_{o}}.

JW=LW+2W=WL+λW\begin{align*} \frac{\partial J}{\partial \mathbf{W}} &= {\color{Orange} {\frac{\partial L}{\partial \mathbf{W}}}} + {\color{Cyan} {\frac{\partial \ell_2}{\partial \mathbf{W}}}} \\ &= {\color{Orange} {\nabla_{\mathbf{W}}L}} + {\color{Cyan} {\lambda \mathbf{W}}} \end{align*}
@add_to_class(LinearRegression)
def update(self, x: torch.Tensor, y_true: torch.Tensor, y_pred: torch.Tensor, lr: float):
    """
    Update the model parameters with L2 regularization.

    Args:
       x: Input tensor of shape (n_samples, n_features).
       y_true: Target tensor of shape (n_samples, out_features).
       y_pred: Predicted output tensor of shape (n_samples, out_features).
       lr: Learning rate. 
    """
    delta = 2 * (y_pred - y_true) / y_true.numel()
    self.b -= lr * delta.sum(axis=0)
    self.w -= lr * (torch.matmul(x.T, delta) + self.lambd * self.w) # L2 regularization

Scratch vs Torch.nn

Torch.nn model

class TorchLinearRegression(nn.Module):
    def __init__(self, n_features, n_out_features):
        super(TorchLinearRegression, self).__init__()
        self.layer = nn.Linear(n_features, n_out_features, device=device)
        self.loss = nn.MSELoss()

    def forward(self, x):
        return self.layer(x)
    
    def evaluate(self, x, y):
        self.eval()
        with torch.no_grad():
            y_pred = self.forward(x)
            return self.loss(y_pred, y).item()
    
    def fit(self, x, y, epochs, lr, batch_size, x_valid, y_valid, weight_decay):
        optimizer = torch.optim.SGD([
            {'params': self.layer.weight, 'weight_decay': weight_decay},
            {'params': self.layer.bias} # it is important to specify the weight decay for the bias.
        ], lr=lr)

        for epoch in range(epochs):
            loss_t = []
            for batch in range(0, len(y), batch_size):
                end_batch = batch + batch_size

                y_pred = self.forward(x[batch:end_batch])
                loss = self.loss(y_pred, y[batch:end_batch])
                loss_t.append(loss.item())

                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

            loss_t = round(sum(loss_t) / len(loss_t), 4)
            loss_v = round(self.evaluate(x_valid, y_valid), 4)
            print(f'epoch: {epoch} - MSE: {loss_t} - MSE_v: {loss_v}')
torch_model = TorchLinearRegression(N, NO)

scratch model

LAMBD: float = 0.01

model = LinearRegression(N, NO, LAMBD)
model.lambd
0.01

evals

import MAPE modified

# This cell imports torch_mape 
# if you are running this notebook locally 
# or from Google Colab.

import os
import sys

module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

try:
    from tools.torch_metrics import torch_mape as mape
    print('mape imported locally.')
except ModuleNotFoundError:
    import subprocess

    repo_url = 'https://raw.githubusercontent.com/PilotLeoYan/inside-deep-learning/main/content/tools/torch_metrics.py'
    local_file = 'torch_metrics.py'
    
    subprocess.run(['wget', repo_url, '-O', local_file], check=True)
    try:
        from torch_metrics import torch_mape as mape # type: ignore
        print('mape imported from GitHub.')
    except Exception as e:
        print(e)
mape imported locally.

prediction

mape(
    model.predict(X_valid),
    torch_model.forward(X_valid)
)
50.58237220072394

copy parameters

model.copy_params(torch_model.layer)
parameters = (model.b.clone(), model.w.clone())

prediction after copy parameters

mape(
    model.predict(X_valid),
    torch_model.forward(X_valid)
)
0.0

loss

mape(
    model.evaluate(X_valid, Y_valid),
    torch_model.evaluate(X_valid, Y_valid)
)
0.0

train

LR: float = 0.01 # learning rate
EPOCHS: int = 16 # number of epochs
BATCH: int = len(X_train) // 3 # batch size
torch_model.fit(
    X_train, Y_train, 
    EPOCHS, LR, BATCH, 
    X_valid, Y_valid,
    LAMBD
)
epoch: 0 - MSE: 11756.956 - MSE_v: 14593.4043
epoch: 1 - MSE: 11206.5149 - MSE_v: 14001.1391
epoch: 2 - MSE: 10690.0031 - MSE_v: 13436.4135
epoch: 3 - MSE: 10204.3701 - MSE_v: 12897.5703
epoch: 4 - MSE: 9746.9354 - MSE_v: 12383.0972
epoch: 5 - MSE: 9315.3374 - MSE_v: 11891.609
epoch: 6 - MSE: 8907.4882 - MSE_v: 11421.833
epoch: 7 - MSE: 8521.5364 - MSE_v: 10972.5963
epoch: 8 - MSE: 8155.8345 - MSE_v: 10542.8148
epoch: 9 - MSE: 7808.9119 - MSE_v: 10131.4834
epoch: 10 - MSE: 7479.451 - MSE_v: 9737.6682
epoch: 11 - MSE: 7166.2671 - MSE_v: 9360.4988
epoch: 12 - MSE: 6868.2916 - MSE_v: 8999.1624
epoch: 13 - MSE: 6584.5572 - MSE_v: 8652.8982
epoch: 14 - MSE: 6314.185 - MSE_v: 8320.9925
epoch: 15 - MSE: 6056.374 - MSE_v: 8002.7747
model.fit(
    X_train, Y_train, 
    EPOCHS, LR, BATCH, 
    X_valid, Y_valid
)
epoch: 0 - MSE: 11756.956 - MSE_v: 14593.4043
epoch: 1 - MSE: 11206.5149 - MSE_v: 14001.1391
epoch: 2 - MSE: 10690.0031 - MSE_v: 13436.4135
epoch: 3 - MSE: 10204.3701 - MSE_v: 12897.5703
epoch: 4 - MSE: 9746.9354 - MSE_v: 12383.0972
epoch: 5 - MSE: 9315.3374 - MSE_v: 11891.609
epoch: 6 - MSE: 8907.4882 - MSE_v: 11421.833
epoch: 7 - MSE: 8521.5364 - MSE_v: 10972.5963
epoch: 8 - MSE: 8155.8345 - MSE_v: 10542.8148
epoch: 9 - MSE: 7808.9119 - MSE_v: 10131.4834
epoch: 10 - MSE: 7479.451 - MSE_v: 9737.6682
epoch: 11 - MSE: 7166.2671 - MSE_v: 9360.4988
epoch: 12 - MSE: 6868.2916 - MSE_v: 8999.1624
epoch: 13 - MSE: 6584.5572 - MSE_v: 8652.8982
epoch: 14 - MSE: 6314.185 - MSE_v: 8320.9925
epoch: 15 - MSE: 6056.374 - MSE_v: 8002.7747

predict after training

mape(
    model.predict(X_valid),
    torch_model.forward(X_valid)
)
5.408162243536969e-16

bias

mape(
    model.b.clone(),
    torch_model.layer.bias.detach()
)
1.464570748352136e-16

weight

mape(
    model.w.clone(),
    torch_model.layer.weight.detach().T
)
1.151145916965978e-16