1.2 - Multivariate Linear Regression

Now we are going to increase the complexity, instead of the perceptron having a single output, it will now have multiple outputs. The word “multivariable” usually means that the perceptron receives multiple inputs, but here we will use it to describe that the perceptron has multiple outputs.

We can think of the multivariate perceptron as a layer of multiple simple perceptrons, and that each perceptron output corresponds to an output feature.

Purpose of this Notebook:

The purposes of this notebook are:

Create a dataset for multivariate linear regression task
Create our own Multivariate Perceptron class from scratch
Calculate the gradient descent from scratch
Train our Multivariate Perceptron
Compare our Perceptron to the one prebuilt by PyTorch
[Extra] Calculate the gradient descent by other way

import torch
from torch import nn

from platform import python_version
python_version(), torch.__version__

('3.12.12', '2.9.0+cu128')

device = 'cpu'
if torch.cuda.is_available():
    device = 'cuda'
device

'cpu'

torch.set_default_dtype(torch.float64)

def add_to_class(Class):  
    """Register functions as methods in created class."""
    def wrapper(obj): setattr(Class, obj.__name__, obj)
    return wrapper

Dataset¶

create dataset¶

\begin{align*} \mathbf{X} &\in \mathbb{R}^{m \times n} \\ \mathbf{Y} &\in \mathbb{R}^{m \times n_{1}} \end{align*}

(1)

where $n_{1}$ is the number of output features.

\mathbf{X} = \begin{bmatrix} x_{11} & x_{12} & \cdots & x_{1n} \\ x_{21} & x_{22} & \cdots & x_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ x_{m1} & x_{m2} & \cdots & x_{mn} \end{bmatrix}

(2)

\mathbf{y} = \begin{bmatrix} y_{11} & y_{12} & \cdots & y_{1n_{1}} \\ y_{21} & y_{22} & \cdots & y_{2n_{1}} \\ \vdots & \vdots & \ddots & \vdots \\ y_{m1} & y_{m2} & \cdots & y_{mn_{1}} \end{bmatrix}

(3)

from sklearn.datasets import make_regression
import random

M: int = 10_100 # number of samples
N: int = 6 # number of input features
NO: int = 3 # number of output features

X, Y = make_regression(
    n_samples=M, 
    n_features=N, 
    n_targets=NO, 
    n_informative=N - 1,
    bias=random.random(),
    noise=1
)

print(X.shape)
print(Y.shape)

(10100, 6)
(10100, 3)

split dataset¶

X_train = torch.tensor(X[:100], device=device)
Y_train = torch.tensor(Y[:100], device=device)
X_train.shape, Y_train.shape

(torch.Size([100, 6]), torch.Size([100, 3]))

X_valid = torch.tensor(X[100:], device=device)
Y_valid = torch.tensor(Y[100:], device=device)
X_valid.shape, Y_valid.shape

(torch.Size([10000, 6]), torch.Size([10000, 3]))

delete raw dataset¶

del X
del Y

Scratch model¶

weights and bias¶

trainable parameters

\begin{align*} \mathbf{W} &\in \mathbb{R}^{n \times n_{1}} \\ \mathbf{b} &\in \mathbb{R}^{n_{1}} \end{align*}

(4)

\mathbf{W} = \begin{bmatrix} w_{11} & w_{12} & \cdots & w_{1n_{1}} \\ w_{21} & w_{22} & \cdots & w_{2n_{1}} \\ \vdots & \vdots & \ddots & \vdots \\ w_{n1} & w_{n2} & \cdots & w_{nn_{1}} \end{bmatrix}

(5)

\mathbf{b} = \begin{bmatrix} b_{1} \\ b_{2} \\ \vdots \\ b_{n_{1}} \end{bmatrix}

(6)

class LinearRegression:
    def __init__(self, n_features: int, out_features: int):
        self.w = torch.randn(n_features, out_features, device=device)
        self.b = torch.randn(out_features, device=device)

    def copy_params(self, torch_layer: torch.nn.modules.linear.Linear):
        """
        Copy the parameters from a module.linear to this model.

        Args:
            torch_layer: Pytorch module from which to copy the parameters.
        """
        self.b.copy_(torch_layer.bias.detach().clone())
        self.w.copy_(torch_layer.weight.T.detach().clone())

weighted sum¶

\mathbf{\hat{Y}}(\mathbf{X}) = \mathbf{X}\mathbf{W} + \mathbf{b} \\ \mathbf{\hat{Y}} : \mathbb{R}^{m \times n} \rightarrow \mathbb{R}^{m \times n_{1}}

(7)

where

\hat{y}_{ij} = \mathbf{x}_{i}^\top \mathbf{w}_{:,j} + b_{j}

(8)

for all $i = 1, \ldots, m$ and $j = 1, \ldots, n_{1}$ .

@add_to_class(LinearRegression)
def predict(self, x: torch.Tensor) -> torch.Tensor:
    """
    Predict the output for input x

    Args:
        x: Input tensor of shape (n_samples, n_features).

    Returns:
        y_pred: Predicted output tensor of shape (n_samples, out_features).
    """
    return torch.matmul(x, self.w) + self.b

MSE¶

Mean Squared Error

\begin{align*} L(\mathbf{\hat{Y}}) &= \frac{1}{mn_{1}} \sum_{i=1}^{m} \sum_{j=1}^{n_{1}}( \hat{y}_{ij} - y_{ij})^{2} \\ L &: \mathbb{R}^{m \times n_{1}} \rightarrow \mathbb{R} \end{align*}

(9)

Vectorized form

L(\mathbf{\hat{Y}}) = \frac{1}{mn_{1}} \text{sum} \left( \left( \mathbf{\hat{Y} - Y} \right)^2 \right)

(10)

where ${\mathbf{A}}^2$ is element-wise power or also ${\mathbf{A}}^2 = \mathbf{A} \odot \mathbf{A}$ .
Note: $\odot$ is called element-wise product or also Hadamard product.

@add_to_class(LinearRegression)
def mse_loss(self, y_true: torch.Tensor, y_pred: torch.Tensor):
    """
    MSE loss function between target y_true and y_pred.

    Args:
        y_true: Target tensor of shape (n_samples, out_features).
        y_pred: Predicted tensor of shape (n_samples, out_features).

    Returns:
        loss: MSE loss between predictions and true values.
    """
    return ((y_pred - y_true)**2).mean().item()

@add_to_class(LinearRegression)
def evaluate(self, x: torch.Tensor, y_true: torch.Tensor):
    """
    Evaluate the model on input x and target y_true using MSE.

    Args:
        x: Input tensor of shape (n_samples, n_features).
        y_true: Target tensor of shape (n_samples, out_features).

    Returns:
        loss: MSE loss between predictions and true values.
    """
    y_pred = self.predict(x)
    return self.mse_loss(y_true, y_pred)

compute gradients¶

There are two ways to compute gradients

Computing each derivative individually and then joining them using the Einstein summation.
Computing an initial derivative and passing it backwards as an argument.

The most common way is to use method 2 because it is easier to visualize and is more optimal. While method 1 needs more computing. We prefer method 2, but we will also use method 1 just for comparison.

MSE derivative¶

\begin{align*} \frac{\partial L}{\partial \hat{y}_{pq}} &= \frac{1}{mn_{1}} \sum_{i=1}^{m} \sum_{j=1}^{n_{1}} \frac{\partial}{\partial \hat{y}_{pq}} \left( (\hat{y}_{ij} - y_{ij})^2 \right) \\ &= \frac{2}{mn_{1}} \sum_{i=1}^{m} \sum_{j=1}^{n_{1}} (\hat{y}_{ij} - y_{ij}) \frac{\partial \hat{y}_{ij}}{\partial \hat{y}_{pq}} \end{align*}

(11)

for all $p = 1, \ldots, m$ and $q = 1, \ldots, n_{1}$ .

\frac{\partial \hat{y}_{ij}}{\partial \hat{y}_{pq}} = \begin{cases} 1 & \text{if } i=p, j=q \\ 0 & \text{otherwise} \end{cases}

(12)

then

\begin{align*} \frac{\partial L}{\partial \hat{y}_{pq}} &= \frac{2}{mn_{1}} \sum_{i=1}^{m} \sum_{j=1}^{n_{1}} (\hat{y}_{ij} - y_{ij}) \frac{\partial \hat{y}_{ij}}{\partial \hat{y}_{pq}} \\ &= \frac{2}{mn_{1}} (\hat{y}_{pq} - y_{pq}) \end{align*}

(13)

therefore

\frac{\partial L}{\partial \mathbf{\hat{Y}}} = \frac{2}{mn_{1}} \left( \mathbf{\hat{Y}} - \mathbf{Y} \right)

(14)

weighted sum derivative¶

respect to bias¶

\begin{align*} \frac{\partial L}{\partial b_{p}} &= \frac{1}{mn_{1}} \sum_{i=1}^{m} \sum_{j=1}^{n_{1}} \frac{\partial}{\partial b_{p}} \left( (\hat{y}_{ij} - y_{ij})^2 \right) \\ &= \frac{2}{mn_{1}} \sum_{i=1}^{m} \sum_{j=1}^{n_{1}} (\hat{y}_{ij} - y_{ij}) \frac{\partial \hat{y}_{ij}}{\partial b_{p}} \\ &= \sum_{i=1}^{m} \sum_{j=1}^{n_{1}} \frac{\partial L}{\partial \hat{y}_{ij}} \frac{\partial \hat{y}_{ij}}{\partial b_{p}} \end{align*}

(15)

for all $p = 1, \ldots, n_{1}$ .

\frac{\partial \hat{y}_{ij}}{\partial b_{p}} = \begin{cases} 1 & \text{if } j=p \\ 0 & \text{otherwise} \end{cases}

(16)

then

\begin{align*} \frac{\partial L}{\partial b_{p}} &= \sum_{i=1}^{m} \sum_{j=1}^{n_{1}} \frac{\partial L}{\partial \hat{y}_{ij}} \frac{\partial \hat{y}_{ij}}{\partial b_{p}} \\ &= \sum_{i=1}^{m} \frac{\partial L}{\partial \hat{y}_{ip}} \end{align*}

(17)

therefore

\frac{\partial L}{\partial \mathbf{b}} = \mathbf{1} \frac{\partial L}{\partial \mathbf{\hat{Y}}}

(18)

where $\mathbf{1} \in \mathbb{R}^{m}$ .

respect to weight¶

\begin{align*} \frac{\partial L}{\partial w_{pq}} &= \frac{1}{mn_{1}} \sum_{i=1}^{m} \sum_{j=1}^{n_{1}} \frac{\partial}{\partial w_{pq}} \left( (\hat{y}_{ij} - y_{ij})^2 \right) \\ &= \frac{2}{mn_{1}} \sum_{i=1}^{m} \sum_{j=1}^{n_{1}} (\hat{y}_{ij} - y_{ij}) \frac{\partial \hat{y}_{ij}}{\partial w_{pq}} \\ &= \sum_{i=1}^{m} \sum_{j=1}^{n_{1}} \frac{\partial L}{\partial \hat{y}_{ij}} \frac{\partial \hat{y}_{ij}}{\partial w_{pq}} \end{align*}

(19)

for all $p = 1, \ldots, n$ and $q = 1, \ldots, n_{1}$ .

\frac{\partial \hat{y}_{ij}}{\partial w_{pq}} = \begin{cases} x_{ip} & \text{if } j=q \\ 0 & \text{otherwise} \end{cases}

(20)

then

\begin{align*} \frac{\partial L}{\partial w_{pq}} &= \sum_{i=1}^{m} \sum_{j=1}^{n_{1}} \frac{\partial L}{\partial \hat{y}_{ij}} \frac{\partial \hat{y}_{ij}}{\partial w_{pq}} \\ &= \sum_{i=1}^{m} x_{ip} \frac{\partial L}{\partial \hat{y}_{iq}} \\ &= (x_{:,p})^\top \frac{\partial L}{\partial \hat{y}_{:,q}} \\ &= x^\top_{p,:} \frac{\partial L}{\partial \hat{y}_{:,q}} \end{align*}

(21)

therefore

\frac{\partial L}{\partial \mathbf{w}} = \mathbf{X}^\top \frac{\partial L}{\partial \mathbf{\hat{Y}}}

(22)

gradients¶

\begin{align*} \nabla_{\mathbf{b}}L = \frac{\partial L}{\partial \mathbf{b}} &= {\color{Cyan} {\frac{\partial L}{\partial \mathbf{\hat{Y}}}}} {\color{Orange} {\frac{\partial \mathbf{\hat{Y}}}{\partial \mathbf{b}}}} \\ &= {\color{Cyan} {\frac{2}{mn_{1}}}} {\color{Orange} {\mathbf{1}}} {\color{Cyan} {\left(\mathbf{\hat{Y}} - \mathbf{Y} \right)}} \end{align*}

(23)

and

\begin{align*} \nabla_{\mathbf{W}}L = \frac{\partial L}{\partial \mathbf{W}} &= {\color{Cyan} {\frac{\partial L}{\partial \mathbf{\hat{Y}}}}} {\color{Magenta} {\frac{\partial \mathbf{\hat{Y}}}{\partial \mathbf{W}}}} \\ &= {\color{Cyan} {\frac{2}{mn_{1}}}} {\color{Magenta} {\mathbf{X}^\top}} {\color{Cyan} {\left(\mathbf{\hat{Y}} - \mathbf{Y} \right)}} \end{align*}

(24)

Parameters update¶

@add_to_class(LinearRegression)
def update(self, x: torch.Tensor, y_true: torch.Tensor, y_pred: torch.Tensor, lr: float):
    """
    Update the model parameters.

    Args:
       x: Input tensor of shape (n_samples, n_features).
       y_true: Target tensor of shape (n_samples, n_features).
       y_pred: Predicted output tensor of shape (n_samples, n_features).
       lr: Learning rate. 
    """
    delta = 2 * (y_pred - y_true) / y_true.numel()
    self.b -= lr * delta.sum(axis=0)
    self.w -= lr * torch.matmul(x.T, delta)

fit (train)¶

@add_to_class(LinearRegression)
def fit(self, x_train: torch.Tensor, y_train: torch.Tensor, 
        epochs: int, lr: float, batch_size: int, 
        x_valid: torch.Tensor, y_valid: torch.Tensor):
    """
    Fit the model using gradient descent.
    
    Args:
        x_train: Input tensor of shape (n_samples, n_features).
        y_train: Target tensor of shape (n_samples,).
        epochs: Number of epochs to fit.
        lr: learning rate.
        batch_size: Int number of batch.
        x_valid: Input tensor of shape (n_valid_samples, n_features).
        y_valid: Target tensor of shape (n_valid_samples,)
    """
    for epoch in range(epochs):
        loss = []
        for batch in range(0, len(y_train), batch_size):
            end_batch = batch + batch_size

            y_pred = self.predict(x_train[batch:end_batch])

            loss.append(self.mse_loss(
                y_train[batch:end_batch], 
                y_pred
            ))

            self.update(
                x_train[batch:end_batch], 
                y_train[batch:end_batch], 
                y_pred, 
                lr
            )

        loss = round(sum(loss) / len(loss), 4)
        loss_v = round(self.evaluate(x_valid, y_valid), 4)
        print(f'epoch: {epoch} - MSE: {loss} - MSE_v: {loss_v}')

Scratch vs Torch.nn¶

Torch.nn model¶

class TorchLinearRegression(nn.Module):
    def __init__(self, n_features, n_out_features):
        super(TorchLinearRegression, self).__init__()
        self.layer = nn.Linear(n_features, n_out_features, device=device)
        self.loss = nn.MSELoss()

    def forward(self, x):
        return self.layer(x)
    
    def evaluate(self, x, y):
        self.eval()
        with torch.no_grad():
            y_pred = self.forward(x)
            return self.loss(y_pred, y).item()
    
    def fit(self, x, y, epochs, lr, batch_size, x_valid, y_valid):
        optimizer = torch.optim.SGD(self.parameters(), lr=lr)
        for epoch in range(epochs):
            loss_t = []
            for batch in range(0, len(y), batch_size):
                end_batch = batch + batch_size

                y_pred = self.forward(x[batch:end_batch])
                loss = self.loss(y_pred, y[batch:end_batch])
                loss_t.append(loss.item())

                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

            loss_t = round(sum(loss_t) / len(loss_t), 4)
            loss_v = round(self.evaluate(x_valid, y_valid), 4)
            print(f'epoch: {epoch} - MSE: {loss_t} - MSE_v: {loss_v}')

torch_model = TorchLinearRegression(N, NO)

scratch model¶

model = LinearRegression(N, NO)

evals¶

import MAPE modified¶

# This cell imports torch_mape 
# if you are running this notebook locally 
# or from Google Colab.

import os
import sys

module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

try:
    from tools.torch_metrics import torch_mape as mape
    print('mape imported locally.')
except ModuleNotFoundError:
    import subprocess

    repo_url = 'https://raw.githubusercontent.com/PilotLeoYan/inside-deep-learning/main/content/tools/torch_metrics.py'
    local_file = 'torch_metrics.py'
    
    subprocess.run(['wget', repo_url, '-O', local_file], check=True)
    try:
        from torch_metrics import torch_mape as mape # type: ignore
        print('mape imported from GitHub.')
    except Exception as e:
        print(e)

mape imported locally.

predict¶

mape(
    model.predict(X_valid),
    torch_model.forward(X_valid)
)

2012.6907730318287

copy parameters¶

model.copy_params(torch_model.layer)
parameters = (model.b.clone(), model.w.clone())

predict after copy parameters¶

mape(
    model.predict(X_valid),
    torch_model.forward(X_valid)
)

0.0

loss¶

mape(
    model.evaluate(X_valid, Y_valid),
    torch_model.evaluate(X_valid, Y_valid)
)

0.0

train¶

LR = 0.01 # learning rate
EPOCHS = 16 # number of epochs
BATCH = len(X_train) // 3 # batch size

torch_model.fit(
    X_train, Y_train, 
    EPOCHS, LR, BATCH, 
    X_valid, Y_valid
)

epoch: 0 - MSE: 17391.688 - MSE_v: 20044.5903
epoch: 1 - MSE: 16412.7443 - MSE_v: 19176.1657
epoch: 2 - MSE: 15515.7891 - MSE_v: 18354.4923
epoch: 3 - MSE: 14691.1138 - MSE_v: 17575.8712
epoch: 4 - MSE: 13930.3666 - MSE_v: 16837.0262
epoch: 5 - MSE: 13226.3524 - MSE_v: 16135.0464
epoch: 6 - MSE: 12572.8618 - MSE_v: 15467.3357
epoch: 7 - MSE: 11964.5268 - MSE_v: 14831.5704
epoch: 8 - MSE: 11396.6974 - MSE_v: 14225.6629
epoch: 9 - MSE: 10865.3368 - MSE_v: 13647.7305
epoch: 10 - MSE: 10366.9328 - MSE_v: 13096.0693
epoch: 11 - MSE: 9898.4219 - MSE_v: 12569.1311
epoch: 12 - MSE: 9457.1245 - MSE_v: 12065.5042
epoch: 13 - MSE: 9040.6908 - MSE_v: 11583.8965
epoch: 14 - MSE: 8647.0537 - MSE_v: 11123.1215
epoch: 15 - MSE: 8274.389 - MSE_v: 10682.086

model.fit(
    X_train, Y_train, 
    EPOCHS, LR, BATCH, 
    X_valid, Y_valid
)

epoch: 0 - MSE: 17391.688 - MSE_v: 20044.5903
epoch: 1 - MSE: 16412.7443 - MSE_v: 19176.1657
epoch: 2 - MSE: 15515.7891 - MSE_v: 18354.4923
epoch: 3 - MSE: 14691.1138 - MSE_v: 17575.8712
epoch: 4 - MSE: 13930.3666 - MSE_v: 16837.0262
epoch: 5 - MSE: 13226.3524 - MSE_v: 16135.0464
epoch: 6 - MSE: 12572.8618 - MSE_v: 15467.3357

epoch: 7 - MSE: 11964.5268 - MSE_v: 14831.5704
epoch: 8 - MSE: 11396.6974 - MSE_v: 14225.6629

epoch: 9 - MSE: 10865.3368 - MSE_v: 13647.7305

epoch: 10 - MSE: 10366.9328 - MSE_v: 13096.0693
epoch: 11 - MSE: 9898.4219 - MSE_v: 12569.1311
epoch: 12 - MSE: 9457.1245 - MSE_v: 12065.5042
epoch: 13 - MSE: 9040.6908 - MSE_v: 11583.8965
epoch: 14 - MSE: 8647.0537 - MSE_v: 11123.1215
epoch: 15 - MSE: 8274.389 - MSE_v: 10682.086

predict after training¶

mape(
    model.predict(X_valid),
    torch_model.forward(X_valid)
)

6.41623922729972e-14

weight¶

mape(
    model.w.clone(),
    torch_model.layer.weight.detach().T
)

1.1964026011221981e-14

bias¶

mape(
    model.b.clone(),
    torch_model.layer.bias.detach()
)

9.992714588903911e-14

Compute gradient with einsum¶

\frac{\partial L}{\partial \mathbf{W}} = \frac{\partial L}{\partial \mathbf{\hat{Y}}} \frac{\partial \mathbf{\hat{Y}}}{\partial \mathbf{W}} \\ \frac{\partial L}{\partial \mathbf{b}} = \frac{\partial L}{\partial \mathbf{\hat{Y}}} \frac{\partial \mathbf{\hat{Y}}}{\partial \mathbf{b}}

(25)

where their shapes are

\begin{align*} \frac{\partial L}{\partial \mathbf{W}} &\in \mathbb{R}^{n \times n_{1}} \\ \frac{\partial L}{\partial \mathbf{b}} &\in \mathbb{R}^{n_{1}} \\ \frac{\partial L}{\partial \mathbf{\hat{Y}}} &\in \mathbb{R}^{m \times n_{1}} \\ \frac{\partial \mathbf{\hat{Y}}} {\partial \mathbf{W}} &\in \mathbb{R}^{(m \times n_{1}) \times (n \times n_{1})} \\ \frac{\partial \mathbf{\hat{Y}}} {\partial \mathbf{b}} &\in \mathbb{R}^{(m \times n_{1}) \times (n_{1})} \end{align*}

(26)

Note: check $\frac{\partial \mathbf{\hat{Y}}}{\partial \mathbf{W}}$ has four axes. This is an example because this method requires more computing.

weighted sum derivative respect to bias

\frac{\partial \hat{y}_{ij}}{\partial b_{p}} = \begin{cases} 1 & \text{if } j=p \\ 0 & \text{if } j\neq p \end{cases}

(27)

for all $i = 1, \ldots, m$ and $j, p = 1, \ldots, n_{1}$

weighted sum derivative respect to weight

\frac{\partial \hat{y}_{ij}}{\partial w_{pq}} = \begin{cases} x_{ip} & \text{if } j=q \\ 0 & \text{if } j\neq q \end{cases}

(28)

for all $i = 1, \ldots, m$ ,
$j, q = 1, \ldots, n_{1}$ and
$p = 1, \ldots, n$ .

Vectorized form

\frac{\partial \mathbf{\hat{Y}}}{\partial \mathbf{W}} = \mathbb{I} \otimes \mathbf{X}

(29)

where $\otimes$ is Kronecker product.

therefore using Einstein summation

\begin{align*} {\color{Magenta} {\frac{\partial L}{\partial \mathbf{b}}}} &= {\color{Orange} {\frac{\partial L}{\partial \mathbf{\hat{Y}}}}} {\color{Cyan} {\frac{\partial \mathbf{\hat{Y}}}{\partial \mathbf{b}}}} \\ &\in \mathbb{R}^{ {\color{Orange} {(m \times n_{1})}} \times {\color{Cyan} {(m \times n_{1} \times n_{1})}}} \\ &\in \mathbb{R}^{\color{Magenta} {n_{1}}} \end{align*}

(30)

and

\begin{align*} {\color{Magenta} {\frac{\partial L}{\partial \mathbf{W}}}} &= {\color{Orange} {\frac{\partial L}{\partial \mathbf{\hat{Y}}}}} {\color{Cyan} {\frac{\partial \mathbf{\hat{Y}}}{\partial \mathbf{W}}}} \\ &\in \mathbb{R}^{ {\color{Orange} {(m \times n_{1})}} \times {\color{Cyan} {(m \times n_{1} \times n \times n_{1})}}} \\ &\in \mathbb{R}^{\color{Magenta} {n \times n_{1}}} \end{align*}

(31)

Model¶

class EinsumLinearRegression(LinearRegression):
    def update(self, x: torch.Tensor, y_true: torch.Tensor, y_pred: torch.Tensor, lr: float):
        """
        Update the model parameters.

        Args:
            x: Input tensor of shape (n_samples, n_features).
            y_true: Target tensor of shape (n_samples, n_features).
            y_pred: Predicted output tensor of shape (n_samples, n_features).
            lr: Learning rate. 
        """
        delta = 2 * (y_pred - y_true) / y_true.numel()
        # d L / d b
        self.b -= lr * delta.sum(axis=0)
        # d L / d W
        identity = torch.eye(y_true.shape[-1], device=device)
        w_der = torch.kron(
            x.unsqueeze(1).unsqueeze(3),
            identity.unsqueeze(0).unsqueeze(2)
        )
        self.w -= lr * torch.einsum('pq,pqij->ij', delta, w_der)

einsum_model = EinsumLinearRegression(N, NO)
einsum_model.b.copy_(parameters[0])
einsum_model.w.copy_(parameters[1])

tensor([[ 0.3878, -0.0049,  0.2929],
        [ 0.1414, -0.0438,  0.0712],
        [-0.3527,  0.3928, -0.0829],
        [-0.3182, -0.0721,  0.0125],
        [-0.3025,  0.3572, -0.1896],
        [ 0.2709, -0.2265,  0.2992]])

einsum_model.fit(
    X_train, Y_train, 
    EPOCHS, LR, BATCH, 
    X_valid, Y_valid
)

epoch: 0 - MSE: 17391.688 - MSE_v: 20044.5903
epoch: 1 - MSE: 16412.7443 - MSE_v: 19176.1657
epoch: 2 - MSE: 15515.7891 - MSE_v: 18354.4923

epoch: 3 - MSE: 14691.1138 - MSE_v: 17575.8712
epoch: 4 - MSE: 13930.3666 - MSE_v: 16837.0262
epoch: 5 - MSE: 13226.3524 - MSE_v: 16135.0464

epoch: 6 - MSE: 12572.8618 - MSE_v: 15467.3357
epoch: 7 - MSE: 11964.5268 - MSE_v: 14831.5704
epoch: 8 - MSE: 11396.6974 - MSE_v: 14225.6629

epoch: 9 - MSE: 10865.3368 - MSE_v: 13647.7305
epoch: 10 - MSE: 10366.9328 - MSE_v: 13096.0693
epoch: 11 - MSE: 9898.4219 - MSE_v: 12569.1311
epoch: 12 - MSE: 9457.1245 - MSE_v: 12065.5042
epoch: 13 - MSE: 9040.6908 - MSE_v: 11583.8965
epoch: 14 - MSE: 8647.0537 - MSE_v: 11123.1215
epoch: 15 - MSE: 8274.389 - MSE_v: 10682.086

mape(
    einsum_model.w.clone(),
    torch_model.layer.weight.detach().T
)

1.133194750373222e-14

mape(
    einsum_model.b.clone(),
    torch_model.layer.bias.detach()
)

8.576651852262388e-14