Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

1.2 - Multivariate Linear Regression

Now we are going to increase the complexity, instead of the perceptron having a single output, it will now have multiple outputs. The word “multivariable” usually means that the perceptron receives multiple inputs, but here we will use it to describe that the perceptron has multiple outputs.

We can think of the multivariate perceptron as a layer of multiple simple perceptrons, and that each perceptron output corresponds to an output feature.

Purpose of this Notebook:

The purposes of this notebook are:

  1. Create a dataset for multivariate linear regression task

  2. Create our own Multivariate Perceptron class from scratch

  3. Calculate the gradient descent from scratch

  4. Train our Multivariate Perceptron

  5. Compare our Perceptron to the one prebuilt by PyTorch

  6. [Extra] Calculate the gradient descent by other way

import torch
from torch import nn

from platform import python_version
python_version(), torch.__version__
('3.12.12', '2.9.0+cu128')
device = 'cpu'
if torch.cuda.is_available():
    device = 'cuda'
device
'cpu'
torch.set_default_dtype(torch.float64)
def add_to_class(Class):  
    """Register functions as methods in created class."""
    def wrapper(obj): setattr(Class, obj.__name__, obj)
    return wrapper

Dataset

create dataset

XRm×nYRm×n1\begin{align*} \mathbf{X} &\in \mathbb{R}^{m \times n} \\ \mathbf{Y} &\in \mathbb{R}^{m \times n_{1}} \end{align*}

where n1n_{1} is the number of output features.

X=[x11x12x1nx21x22x2nxm1xm2xmn]\mathbf{X} = \begin{bmatrix} x_{11} & x_{12} & \cdots & x_{1n} \\ x_{21} & x_{22} & \cdots & x_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ x_{m1} & x_{m2} & \cdots & x_{mn} \end{bmatrix}
y=[y11y12y1n1y21y22y2n1ym1ym2ymn1]\mathbf{y} = \begin{bmatrix} y_{11} & y_{12} & \cdots & y_{1n_{1}} \\ y_{21} & y_{22} & \cdots & y_{2n_{1}} \\ \vdots & \vdots & \ddots & \vdots \\ y_{m1} & y_{m2} & \cdots & y_{mn_{1}} \end{bmatrix}
from sklearn.datasets import make_regression
import random

M: int = 10_100 # number of samples
N: int = 6 # number of input features
NO: int = 3 # number of output features

X, Y = make_regression(
    n_samples=M, 
    n_features=N, 
    n_targets=NO, 
    n_informative=N - 1,
    bias=random.random(),
    noise=1
)

print(X.shape)
print(Y.shape)
(10100, 6)
(10100, 3)

split dataset

X_train = torch.tensor(X[:100], device=device)
Y_train = torch.tensor(Y[:100], device=device)
X_train.shape, Y_train.shape
(torch.Size([100, 6]), torch.Size([100, 3]))
X_valid = torch.tensor(X[100:], device=device)
Y_valid = torch.tensor(Y[100:], device=device)
X_valid.shape, Y_valid.shape
(torch.Size([10000, 6]), torch.Size([10000, 3]))

delete raw dataset

del X
del Y

Scratch model

weights and bias

trainable parameters

WRn×n1bRn1\begin{align*} \mathbf{W} &\in \mathbb{R}^{n \times n_{1}} \\ \mathbf{b} &\in \mathbb{R}^{n_{1}} \end{align*}
W=[w11w12w1n1w21w22w2n1wn1wn2wnn1]\mathbf{W} = \begin{bmatrix} w_{11} & w_{12} & \cdots & w_{1n_{1}} \\ w_{21} & w_{22} & \cdots & w_{2n_{1}} \\ \vdots & \vdots & \ddots & \vdots \\ w_{n1} & w_{n2} & \cdots & w_{nn_{1}} \end{bmatrix}
b=[b1b2bn1]\mathbf{b} = \begin{bmatrix} b_{1} \\ b_{2} \\ \vdots \\ b_{n_{1}} \end{bmatrix}
class LinearRegression:
    def __init__(self, n_features: int, out_features: int):
        self.w = torch.randn(n_features, out_features, device=device)
        self.b = torch.randn(out_features, device=device)

    def copy_params(self, torch_layer: torch.nn.modules.linear.Linear):
        """
        Copy the parameters from a module.linear to this model.

        Args:
            torch_layer: Pytorch module from which to copy the parameters.
        """
        self.b.copy_(torch_layer.bias.detach().clone())
        self.w.copy_(torch_layer.weight.T.detach().clone())

weighted sum

Y^(X)=XW+bY^:Rm×nRm×n1\mathbf{\hat{Y}}(\mathbf{X}) = \mathbf{X}\mathbf{W} + \mathbf{b} \\ \mathbf{\hat{Y}} : \mathbb{R}^{m \times n} \rightarrow \mathbb{R}^{m \times n_{1}}

where

y^ij=xiw:,j+bj\hat{y}_{ij} = \mathbf{x}_{i}^\top \mathbf{w}_{:,j} + b_{j}

for all i=1,,mi = 1, \ldots, m and j=1,,n1j = 1, \ldots, n_{1}.

@add_to_class(LinearRegression)
def predict(self, x: torch.Tensor) -> torch.Tensor:
    """
    Predict the output for input x

    Args:
        x: Input tensor of shape (n_samples, n_features).

    Returns:
        y_pred: Predicted output tensor of shape (n_samples, out_features).
    """
    return torch.matmul(x, self.w) + self.b

MSE

Mean Squared Error

L(Y^)=1mn1i=1mj=1n1(y^ijyij)2L:Rm×n1R\begin{align*} L(\mathbf{\hat{Y}}) &= \frac{1}{mn_{1}} \sum_{i=1}^{m} \sum_{j=1}^{n_{1}}( \hat{y}_{ij} - y_{ij})^{2} \\ L &: \mathbb{R}^{m \times n_{1}} \rightarrow \mathbb{R} \end{align*}

Vectorized form

L(Y^)=1mn1sum((Y^Y)2)L(\mathbf{\hat{Y}}) = \frac{1}{mn_{1}} \text{sum} \left( \left( \mathbf{\hat{Y} - Y} \right)^2 \right)

where A2{\mathbf{A}}^2 is element-wise power or also A2=AA{\mathbf{A}}^2 = \mathbf{A} \odot \mathbf{A}.
Note: \odot is called element-wise product or also Hadamard product.

@add_to_class(LinearRegression)
def mse_loss(self, y_true: torch.Tensor, y_pred: torch.Tensor):
    """
    MSE loss function between target y_true and y_pred.

    Args:
        y_true: Target tensor of shape (n_samples, out_features).
        y_pred: Predicted tensor of shape (n_samples, out_features).

    Returns:
        loss: MSE loss between predictions and true values.
    """
    return ((y_pred - y_true)**2).mean().item()

@add_to_class(LinearRegression)
def evaluate(self, x: torch.Tensor, y_true: torch.Tensor):
    """
    Evaluate the model on input x and target y_true using MSE.

    Args:
        x: Input tensor of shape (n_samples, n_features).
        y_true: Target tensor of shape (n_samples, out_features).

    Returns:
        loss: MSE loss between predictions and true values.
    """
    y_pred = self.predict(x)
    return self.mse_loss(y_true, y_pred)

compute gradients

There are two ways to compute gradients

  1. Computing each derivative individually and then joining them using the Einstein summation.

  2. Computing an initial derivative and passing it backwards as an argument.

The most common way is to use method 2 because it is easier to visualize and is more optimal. While method 1 needs more computing. We prefer method 2, but we will also use method 1 just for comparison.

MSE derivative

Ly^pq=1mn1i=1mj=1n1y^pq((y^ijyij)2)=2mn1i=1mj=1n1(y^ijyij)y^ijy^pq\begin{align*} \frac{\partial L}{\partial \hat{y}_{pq}} &= \frac{1}{mn_{1}} \sum_{i=1}^{m} \sum_{j=1}^{n_{1}} \frac{\partial}{\partial \hat{y}_{pq}} \left( (\hat{y}_{ij} - y_{ij})^2 \right) \\ &= \frac{2}{mn_{1}} \sum_{i=1}^{m} \sum_{j=1}^{n_{1}} (\hat{y}_{ij} - y_{ij}) \frac{\partial \hat{y}_{ij}}{\partial \hat{y}_{pq}} \end{align*}

for all p=1,,mp = 1, \ldots, m and q=1,,n1q = 1, \ldots, n_{1}.

y^ijy^pq={1if i=p,j=q0otherwise\frac{\partial \hat{y}_{ij}}{\partial \hat{y}_{pq}} = \begin{cases} 1 & \text{if } i=p, j=q \\ 0 & \text{otherwise} \end{cases}

then

Ly^pq=2mn1i=1mj=1n1(y^ijyij)y^ijy^pq=2mn1(y^pqypq)\begin{align*} \frac{\partial L}{\partial \hat{y}_{pq}} &= \frac{2}{mn_{1}} \sum_{i=1}^{m} \sum_{j=1}^{n_{1}} (\hat{y}_{ij} - y_{ij}) \frac{\partial \hat{y}_{ij}}{\partial \hat{y}_{pq}} \\ &= \frac{2}{mn_{1}} (\hat{y}_{pq} - y_{pq}) \end{align*}

therefore

LY^=2mn1(Y^Y)\frac{\partial L}{\partial \mathbf{\hat{Y}}} = \frac{2}{mn_{1}} \left( \mathbf{\hat{Y}} - \mathbf{Y} \right)

weighted sum derivative

respect to bias
Lbp=1mn1i=1mj=1n1bp((y^ijyij)2)=2mn1i=1mj=1n1(y^ijyij)y^ijbp=i=1mj=1n1Ly^ijy^ijbp\begin{align*} \frac{\partial L}{\partial b_{p}} &= \frac{1}{mn_{1}} \sum_{i=1}^{m} \sum_{j=1}^{n_{1}} \frac{\partial}{\partial b_{p}} \left( (\hat{y}_{ij} - y_{ij})^2 \right) \\ &= \frac{2}{mn_{1}} \sum_{i=1}^{m} \sum_{j=1}^{n_{1}} (\hat{y}_{ij} - y_{ij}) \frac{\partial \hat{y}_{ij}}{\partial b_{p}} \\ &= \sum_{i=1}^{m} \sum_{j=1}^{n_{1}} \frac{\partial L}{\partial \hat{y}_{ij}} \frac{\partial \hat{y}_{ij}}{\partial b_{p}} \end{align*}

for all p=1,,n1p = 1, \ldots, n_{1}.

y^ijbp={1if j=p0otherwise\frac{\partial \hat{y}_{ij}}{\partial b_{p}} = \begin{cases} 1 & \text{if } j=p \\ 0 & \text{otherwise} \end{cases}

then

Lbp=i=1mj=1n1Ly^ijy^ijbp=i=1mLy^ip\begin{align*} \frac{\partial L}{\partial b_{p}} &= \sum_{i=1}^{m} \sum_{j=1}^{n_{1}} \frac{\partial L}{\partial \hat{y}_{ij}} \frac{\partial \hat{y}_{ij}}{\partial b_{p}} \\ &= \sum_{i=1}^{m} \frac{\partial L}{\partial \hat{y}_{ip}} \end{align*}

therefore

Lb=1LY^\frac{\partial L}{\partial \mathbf{b}} = \mathbf{1} \frac{\partial L}{\partial \mathbf{\hat{Y}}}

where 1Rm\mathbf{1} \in \mathbb{R}^{m}.

respect to weight
Lwpq=1mn1i=1mj=1n1wpq((y^ijyij)2)=2mn1i=1mj=1n1(y^ijyij)y^ijwpq=i=1mj=1n1Ly^ijy^ijwpq\begin{align*} \frac{\partial L}{\partial w_{pq}} &= \frac{1}{mn_{1}} \sum_{i=1}^{m} \sum_{j=1}^{n_{1}} \frac{\partial}{\partial w_{pq}} \left( (\hat{y}_{ij} - y_{ij})^2 \right) \\ &= \frac{2}{mn_{1}} \sum_{i=1}^{m} \sum_{j=1}^{n_{1}} (\hat{y}_{ij} - y_{ij}) \frac{\partial \hat{y}_{ij}}{\partial w_{pq}} \\ &= \sum_{i=1}^{m} \sum_{j=1}^{n_{1}} \frac{\partial L}{\partial \hat{y}_{ij}} \frac{\partial \hat{y}_{ij}}{\partial w_{pq}} \end{align*}

for all p=1,,np = 1, \ldots, n and q=1,,n1q = 1, \ldots, n_{1}.

y^ijwpq={xipif j=q0otherwise\frac{\partial \hat{y}_{ij}}{\partial w_{pq}} = \begin{cases} x_{ip} & \text{if } j=q \\ 0 & \text{otherwise} \end{cases}

then

Lwpq=i=1mj=1n1Ly^ijy^ijwpq=i=1mxipLy^iq=(x:,p)Ly^:,q=xp,:Ly^:,q\begin{align*} \frac{\partial L}{\partial w_{pq}} &= \sum_{i=1}^{m} \sum_{j=1}^{n_{1}} \frac{\partial L}{\partial \hat{y}_{ij}} \frac{\partial \hat{y}_{ij}}{\partial w_{pq}} \\ &= \sum_{i=1}^{m} x_{ip} \frac{\partial L}{\partial \hat{y}_{iq}} \\ &= (x_{:,p})^\top \frac{\partial L}{\partial \hat{y}_{:,q}} \\ &= x^\top_{p,:} \frac{\partial L}{\partial \hat{y}_{:,q}} \end{align*}

therefore

Lw=XLY^\frac{\partial L}{\partial \mathbf{w}} = \mathbf{X}^\top \frac{\partial L}{\partial \mathbf{\hat{Y}}}

gradients

bL=Lb=LY^Y^b=2mn11(Y^Y)\begin{align*} \nabla_{\mathbf{b}}L = \frac{\partial L}{\partial \mathbf{b}} &= {\color{Cyan} {\frac{\partial L}{\partial \mathbf{\hat{Y}}}}} {\color{Orange} {\frac{\partial \mathbf{\hat{Y}}}{\partial \mathbf{b}}}} \\ &= {\color{Cyan} {\frac{2}{mn_{1}}}} {\color{Orange} {\mathbf{1}}} {\color{Cyan} {\left(\mathbf{\hat{Y}} - \mathbf{Y} \right)}} \end{align*}

and

WL=LW=LY^Y^W=2mn1X(Y^Y)\begin{align*} \nabla_{\mathbf{W}}L = \frac{\partial L}{\partial \mathbf{W}} &= {\color{Cyan} {\frac{\partial L}{\partial \mathbf{\hat{Y}}}}} {\color{Magenta} {\frac{\partial \mathbf{\hat{Y}}}{\partial \mathbf{W}}}} \\ &= {\color{Cyan} {\frac{2}{mn_{1}}}} {\color{Magenta} {\mathbf{X}^\top}} {\color{Cyan} {\left(\mathbf{\hat{Y}} - \mathbf{Y} \right)}} \end{align*}

Parameters update

@add_to_class(LinearRegression)
def update(self, x: torch.Tensor, y_true: torch.Tensor, y_pred: torch.Tensor, lr: float):
    """
    Update the model parameters.

    Args:
       x: Input tensor of shape (n_samples, n_features).
       y_true: Target tensor of shape (n_samples, n_features).
       y_pred: Predicted output tensor of shape (n_samples, n_features).
       lr: Learning rate. 
    """
    delta = 2 * (y_pred - y_true) / y_true.numel()
    self.b -= lr * delta.sum(axis=0)
    self.w -= lr * torch.matmul(x.T, delta)

fit (train)

@add_to_class(LinearRegression)
def fit(self, x_train: torch.Tensor, y_train: torch.Tensor, 
        epochs: int, lr: float, batch_size: int, 
        x_valid: torch.Tensor, y_valid: torch.Tensor):
    """
    Fit the model using gradient descent.
    
    Args:
        x_train: Input tensor of shape (n_samples, n_features).
        y_train: Target tensor of shape (n_samples,).
        epochs: Number of epochs to fit.
        lr: learning rate.
        batch_size: Int number of batch.
        x_valid: Input tensor of shape (n_valid_samples, n_features).
        y_valid: Target tensor of shape (n_valid_samples,)
    """
    for epoch in range(epochs):
        loss = []
        for batch in range(0, len(y_train), batch_size):
            end_batch = batch + batch_size

            y_pred = self.predict(x_train[batch:end_batch])

            loss.append(self.mse_loss(
                y_train[batch:end_batch], 
                y_pred
            ))

            self.update(
                x_train[batch:end_batch], 
                y_train[batch:end_batch], 
                y_pred, 
                lr
            )

        loss = round(sum(loss) / len(loss), 4)
        loss_v = round(self.evaluate(x_valid, y_valid), 4)
        print(f'epoch: {epoch} - MSE: {loss} - MSE_v: {loss_v}')

Scratch vs Torch.nn

Torch.nn model

class TorchLinearRegression(nn.Module):
    def __init__(self, n_features, n_out_features):
        super(TorchLinearRegression, self).__init__()
        self.layer = nn.Linear(n_features, n_out_features, device=device)
        self.loss = nn.MSELoss()

    def forward(self, x):
        return self.layer(x)
    
    def evaluate(self, x, y):
        self.eval()
        with torch.no_grad():
            y_pred = self.forward(x)
            return self.loss(y_pred, y).item()
    
    def fit(self, x, y, epochs, lr, batch_size, x_valid, y_valid):
        optimizer = torch.optim.SGD(self.parameters(), lr=lr)
        for epoch in range(epochs):
            loss_t = []
            for batch in range(0, len(y), batch_size):
                end_batch = batch + batch_size

                y_pred = self.forward(x[batch:end_batch])
                loss = self.loss(y_pred, y[batch:end_batch])
                loss_t.append(loss.item())

                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

            loss_t = round(sum(loss_t) / len(loss_t), 4)
            loss_v = round(self.evaluate(x_valid, y_valid), 4)
            print(f'epoch: {epoch} - MSE: {loss_t} - MSE_v: {loss_v}')
torch_model = TorchLinearRegression(N, NO)

scratch model

model = LinearRegression(N, NO)

evals

import MAPE modified

# This cell imports torch_mape 
# if you are running this notebook locally 
# or from Google Colab.

import os
import sys

module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

try:
    from tools.torch_metrics import torch_mape as mape
    print('mape imported locally.')
except ModuleNotFoundError:
    import subprocess

    repo_url = 'https://raw.githubusercontent.com/PilotLeoYan/inside-deep-learning/main/content/tools/torch_metrics.py'
    local_file = 'torch_metrics.py'
    
    subprocess.run(['wget', repo_url, '-O', local_file], check=True)
    try:
        from torch_metrics import torch_mape as mape # type: ignore
        print('mape imported from GitHub.')
    except Exception as e:
        print(e)
mape imported locally.

predict

mape(
    model.predict(X_valid),
    torch_model.forward(X_valid)
)
2012.6907730318287

copy parameters

model.copy_params(torch_model.layer)
parameters = (model.b.clone(), model.w.clone())

predict after copy parameters

mape(
    model.predict(X_valid),
    torch_model.forward(X_valid)
)
0.0

loss

mape(
    model.evaluate(X_valid, Y_valid),
    torch_model.evaluate(X_valid, Y_valid)
)
0.0

train

LR = 0.01 # learning rate
EPOCHS = 16 # number of epochs
BATCH = len(X_train) // 3 # batch size
torch_model.fit(
    X_train, Y_train, 
    EPOCHS, LR, BATCH, 
    X_valid, Y_valid
)
epoch: 0 - MSE: 17391.688 - MSE_v: 20044.5903
epoch: 1 - MSE: 16412.7443 - MSE_v: 19176.1657
epoch: 2 - MSE: 15515.7891 - MSE_v: 18354.4923
epoch: 3 - MSE: 14691.1138 - MSE_v: 17575.8712
epoch: 4 - MSE: 13930.3666 - MSE_v: 16837.0262
epoch: 5 - MSE: 13226.3524 - MSE_v: 16135.0464
epoch: 6 - MSE: 12572.8618 - MSE_v: 15467.3357
epoch: 7 - MSE: 11964.5268 - MSE_v: 14831.5704
epoch: 8 - MSE: 11396.6974 - MSE_v: 14225.6629
epoch: 9 - MSE: 10865.3368 - MSE_v: 13647.7305
epoch: 10 - MSE: 10366.9328 - MSE_v: 13096.0693
epoch: 11 - MSE: 9898.4219 - MSE_v: 12569.1311
epoch: 12 - MSE: 9457.1245 - MSE_v: 12065.5042
epoch: 13 - MSE: 9040.6908 - MSE_v: 11583.8965
epoch: 14 - MSE: 8647.0537 - MSE_v: 11123.1215
epoch: 15 - MSE: 8274.389 - MSE_v: 10682.086
model.fit(
    X_train, Y_train, 
    EPOCHS, LR, BATCH, 
    X_valid, Y_valid
)
epoch: 0 - MSE: 17391.688 - MSE_v: 20044.5903
epoch: 1 - MSE: 16412.7443 - MSE_v: 19176.1657
epoch: 2 - MSE: 15515.7891 - MSE_v: 18354.4923
epoch: 3 - MSE: 14691.1138 - MSE_v: 17575.8712
epoch: 4 - MSE: 13930.3666 - MSE_v: 16837.0262
epoch: 5 - MSE: 13226.3524 - MSE_v: 16135.0464
epoch: 6 - MSE: 12572.8618 - MSE_v: 15467.3357
epoch: 7 - MSE: 11964.5268 - MSE_v: 14831.5704
epoch: 8 - MSE: 11396.6974 - MSE_v: 14225.6629
epoch: 9 - MSE: 10865.3368 - MSE_v: 13647.7305
epoch: 10 - MSE: 10366.9328 - MSE_v: 13096.0693
epoch: 11 - MSE: 9898.4219 - MSE_v: 12569.1311
epoch: 12 - MSE: 9457.1245 - MSE_v: 12065.5042
epoch: 13 - MSE: 9040.6908 - MSE_v: 11583.8965
epoch: 14 - MSE: 8647.0537 - MSE_v: 11123.1215
epoch: 15 - MSE: 8274.389 - MSE_v: 10682.086

predict after training

mape(
    model.predict(X_valid),
    torch_model.forward(X_valid)
)
6.41623922729972e-14

weight

mape(
    model.w.clone(),
    torch_model.layer.weight.detach().T
)
1.1964026011221981e-14

bias

mape(
    model.b.clone(),
    torch_model.layer.bias.detach()
)
9.992714588903911e-14

Compute gradient with einsum

LW=LY^Y^WLb=LY^Y^b\frac{\partial L}{\partial \mathbf{W}} = \frac{\partial L}{\partial \mathbf{\hat{Y}}} \frac{\partial \mathbf{\hat{Y}}}{\partial \mathbf{W}} \\ \frac{\partial L}{\partial \mathbf{b}} = \frac{\partial L}{\partial \mathbf{\hat{Y}}} \frac{\partial \mathbf{\hat{Y}}}{\partial \mathbf{b}}

where their shapes are

LWRn×n1LbRn1LY^Rm×n1Y^WR(m×n1)×(n×n1)Y^bR(m×n1)×(n1)\begin{align*} \frac{\partial L}{\partial \mathbf{W}} &\in \mathbb{R}^{n \times n_{1}} \\ \frac{\partial L}{\partial \mathbf{b}} &\in \mathbb{R}^{n_{1}} \\ \frac{\partial L}{\partial \mathbf{\hat{Y}}} &\in \mathbb{R}^{m \times n_{1}} \\ \frac{\partial \mathbf{\hat{Y}}} {\partial \mathbf{W}} &\in \mathbb{R}^{(m \times n_{1}) \times (n \times n_{1})} \\ \frac{\partial \mathbf{\hat{Y}}} {\partial \mathbf{b}} &\in \mathbb{R}^{(m \times n_{1}) \times (n_{1})} \end{align*}

Note: check Y^W\frac{\partial \mathbf{\hat{Y}}}{\partial \mathbf{W}} has four axes. This is an example because this method requires more computing.

weighted sum derivative respect to bias

y^ijbp={1if j=p0if jp\frac{\partial \hat{y}_{ij}}{\partial b_{p}} = \begin{cases} 1 & \text{if } j=p \\ 0 & \text{if } j\neq p \end{cases}

for all i=1,,mi = 1, \ldots, m and j,p=1,,n1j, p = 1, \ldots, n_{1}

weighted sum derivative respect to weight

y^ijwpq={xipif j=q0if jq\frac{\partial \hat{y}_{ij}}{\partial w_{pq}} = \begin{cases} x_{ip} & \text{if } j=q \\ 0 & \text{if } j\neq q \end{cases}

for all i=1,,mi = 1, \ldots, m,
j,q=1,,n1j, q = 1, \ldots, n_{1} and
p=1,,np = 1, \ldots, n.

Vectorized form

Y^W=IX\frac{\partial \mathbf{\hat{Y}}}{\partial \mathbf{W}} = \mathbb{I} \otimes \mathbf{X}

where \otimes is Kronecker product.

therefore using Einstein summation

Lb=LY^Y^bR(m×n1)×(m×n1×n1)Rn1\begin{align*} {\color{Magenta} {\frac{\partial L}{\partial \mathbf{b}}}} &= {\color{Orange} {\frac{\partial L}{\partial \mathbf{\hat{Y}}}}} {\color{Cyan} {\frac{\partial \mathbf{\hat{Y}}}{\partial \mathbf{b}}}} \\ &\in \mathbb{R}^{ {\color{Orange} {(m \times n_{1})}} \times {\color{Cyan} {(m \times n_{1} \times n_{1})}}} \\ &\in \mathbb{R}^{\color{Magenta} {n_{1}}} \end{align*}

and

LW=LY^Y^WR(m×n1)×(m×n1×n×n1)Rn×n1\begin{align*} {\color{Magenta} {\frac{\partial L}{\partial \mathbf{W}}}} &= {\color{Orange} {\frac{\partial L}{\partial \mathbf{\hat{Y}}}}} {\color{Cyan} {\frac{\partial \mathbf{\hat{Y}}}{\partial \mathbf{W}}}} \\ &\in \mathbb{R}^{ {\color{Orange} {(m \times n_{1})}} \times {\color{Cyan} {(m \times n_{1} \times n \times n_{1})}}} \\ &\in \mathbb{R}^{\color{Magenta} {n \times n_{1}}} \end{align*}

Model

class EinsumLinearRegression(LinearRegression):
    def update(self, x: torch.Tensor, y_true: torch.Tensor, y_pred: torch.Tensor, lr: float):
        """
        Update the model parameters.

        Args:
            x: Input tensor of shape (n_samples, n_features).
            y_true: Target tensor of shape (n_samples, n_features).
            y_pred: Predicted output tensor of shape (n_samples, n_features).
            lr: Learning rate. 
        """
        delta = 2 * (y_pred - y_true) / y_true.numel()
        # d L / d b
        self.b -= lr * delta.sum(axis=0)
        # d L / d W
        identity = torch.eye(y_true.shape[-1], device=device)
        w_der = torch.kron(
            x.unsqueeze(1).unsqueeze(3),
            identity.unsqueeze(0).unsqueeze(2)
        )
        self.w -= lr * torch.einsum('pq,pqij->ij', delta, w_der)
einsum_model = EinsumLinearRegression(N, NO)
einsum_model.b.copy_(parameters[0])
einsum_model.w.copy_(parameters[1])
tensor([[ 0.3878, -0.0049, 0.2929], [ 0.1414, -0.0438, 0.0712], [-0.3527, 0.3928, -0.0829], [-0.3182, -0.0721, 0.0125], [-0.3025, 0.3572, -0.1896], [ 0.2709, -0.2265, 0.2992]])
einsum_model.fit(
    X_train, Y_train, 
    EPOCHS, LR, BATCH, 
    X_valid, Y_valid
)
epoch: 0 - MSE: 17391.688 - MSE_v: 20044.5903
epoch: 1 - MSE: 16412.7443 - MSE_v: 19176.1657
epoch: 2 - MSE: 15515.7891 - MSE_v: 18354.4923
epoch: 3 - MSE: 14691.1138 - MSE_v: 17575.8712
epoch: 4 - MSE: 13930.3666 - MSE_v: 16837.0262
epoch: 5 - MSE: 13226.3524 - MSE_v: 16135.0464
epoch: 6 - MSE: 12572.8618 - MSE_v: 15467.3357
epoch: 7 - MSE: 11964.5268 - MSE_v: 14831.5704
epoch: 8 - MSE: 11396.6974 - MSE_v: 14225.6629
epoch: 9 - MSE: 10865.3368 - MSE_v: 13647.7305
epoch: 10 - MSE: 10366.9328 - MSE_v: 13096.0693
epoch: 11 - MSE: 9898.4219 - MSE_v: 12569.1311
epoch: 12 - MSE: 9457.1245 - MSE_v: 12065.5042
epoch: 13 - MSE: 9040.6908 - MSE_v: 11583.8965
epoch: 14 - MSE: 8647.0537 - MSE_v: 11123.1215
epoch: 15 - MSE: 8274.389 - MSE_v: 10682.086
mape(
    einsum_model.w.clone(),
    torch_model.layer.weight.detach().T
)
1.133194750373222e-14
mape(
    einsum_model.b.clone(),
    torch_model.layer.bias.detach()
)
8.576651852262388e-14