Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

1.1 - Simple Linear Regression

If we want to start with a topic before getting into deep learning, the perceptron is a good place to start, as it is the basic unit with which artificial neural networks (ANNs) are built. We can then use multiple perceptrons in parallel to form a dense layer. By using multiple dense layers, we can build a deep neural network (DNN).

The objective of simple linear regression is to predict the target data y\mathbf{y} based on the input data x\mathbf{x}

y=f(x)+ϵ\mathbf{y} = f(\mathbf{x}) + \epsilon

where f()f(\cdot) is the true function, but it is unknown and fixed, and ϵ\epsilon is a intrinsic noise independent of x\mathbf{x}.

So, let’s estimate f^()\hat{f}(\cdot) assuming that ff is linear

y^=f^(x)\hat{\mathbf{y}} = \hat{f}(\mathbf{x})

where y^\hat{\mathbf{y}} is our estimate prediction.

Purpose of this Notebook:

  1. Create a dataset for simple linear regression task

  2. Create our own Perceptron class from scratch

  3. Calculate the gradient descent from scratch

  4. Train our Perceptron

  5. Compare our Perceptron to the one prebuilt by PyTorch

Setup

print('Start package installation...')
Start package installation...
%%capture
%pip install torch
%pip install scikit-learn
print('Packages installed successfully!')
Packages installed successfully!
import torch
from torch import nn

from platform import python_version
python_version(), torch.__version__
('3.12.12', '2.9.0+cu128')
device = 'cpu'
if torch.cuda.is_available():
    device = 'cuda'
device
'cpu'
torch.set_default_dtype(torch.float64)
def add_to_class(Class):
    """Register functions as methods in created class."""
    def wrapper(obj): setattr(Class, obj.__name__, obj)
    return wrapper

Dataset

create dataset

For our supervised task, we have a dataset denoted

D={(x1,y1),,(xm,ym)}\mathcal{D} = \left\{ (x_{1}, y_{1}), \cdots, (x_{m}, y_{m}) \right\}

where mm is the number of samples in our dataset.

We assume that xix_{i} predicts yiy_{i}, and (x1,y1),,(xm,ym)(x_{1}, y_{1}), \cdots, (x_{m}, y_{m}) is independent and identical distributed (idd assumption). Independent means that two samples (xi,yi),(xj,yj),ij(x_{i}, y_{i}), (x_{j}, y_{j}), i \neq j do not statistically depende on each other, and identical distributed means that all (xi,yi)(x_{i}, y_{i}) is distributed from the same unknown distribution.

The input data x\mathbf{x} can be represented as a vector

x=[x1xm]Rm\mathbf{x} = \begin{bmatrix} x_{1} \\ \vdots \\ x_{m} \end{bmatrix} \in \mathbb{R}^{m}

and the target data y\mathbf{y} can be also represented as a vector

y=[y1ym]Rm\mathbf{y} = \begin{bmatrix} y_{1} \\ \vdots \\ y_{m} \end{bmatrix} \in \mathbb{R}^{m}
from sklearn.datasets import make_regression
import random


M: int = 10_100 # number of samples

X, Y = make_regression(
    n_samples=M,
    n_features=1,
    n_targets=1,
    bias=random.random(), # random true bias
    noise=1
)

X = X.squeeze() # remove the axis of length 1

print(X.shape)
print(Y.shape)
(10100,)
(10100,)

split dataset

We are going to split the dataset D\mathcal{D} into two sets, the training dataset Dtrain\mathcal{D}_{\text{train}}, validation dataset Dvalid\mathcal{D}_{\text{valid}} and test dataset Dtest\mathcal{D}_{\text{test}}.

  • Train dataset is used to fit model parameters

  • Validation dataset us used to adjust hyperparameters, select models or evaluate training

  • Test dataset is utilized for the purpose of evaluating our pre-trained models

Remark: Dtrain\mathcal{D}_{\text{train}}, Dvalid\mathcal{D}_{\text{valid}} and Dtest\mathcal{D}_{\text{test}} are disjoint, DtrainDvalidDtest=\mathcal{D}_{\text{train}} \cap \mathcal{D}_{\text{valid}} \cap \mathcal{D}_{\text{test}} = \varnothing.

Let’s refer xtrain,ytrain\mathbf{x}_{\text{train}}, \mathbf{y}_{\text{train}} as training input and target data respectively, xvalid,yvalid\mathbf{x}_{\text{valid}}, \mathbf{y}_{\text{valid}} as validation input and target data respectively, and xtest,ytest\mathbf{x}_{\text{test}}, \mathbf{y}_{\text{test}} as test input and target data respectively.

X_train = torch.tensor(X[:100], device=device)
Y_train = torch.tensor(Y[:100], device=device)
X_train.shape, Y_train.shape
(torch.Size([100]), torch.Size([100]))
X_valid = torch.tensor(X[100:], device=device)
Y_valid = torch.tensor(Y[100:], device=device)
X_valid.shape, Y_valid.shape
(torch.Size([10000]), torch.Size([10000]))

⭐️ We are going to leave out the test dataset for now.

delete raw dataset

del X
del Y

Scratch simple perceptron

weight and bias

Our model y^()\hat{\mathbf{y}}(\cdot) has two trainable parameters b,wRb, w \in \mathbb{R}, which are called bias and weight respectively.

class SimpleLinearRegression:
    def __init__(self) -> None:
        self.b = torch.randn(1, device=device)
        self.w = torch.randn(1, device=device)

    def copy_params(self, torch_layer: nn.modules.linear.Linear) -> None:
        """
        Copy the parameters from a module.linear to this model.

        Args:
            torch_layer: Pytorch module from which to copy the parameters.
        """
        self.b.copy_(torch_layer.bias.detach().clone())
        self.w.copy_(torch_layer.weight[0,:].detach().clone())

weighted sum

We selected the weighted sum for y^()\hat{\mathbf{y}}(\cdot) as a linear approximation of the true function f()f(\cdot)

y^:RmRmxy^(x)=b+wx\begin{align} \hat{\mathbf{y}}: \mathbb{R}^{m} &\to \mathbb{R}^{m} \\ \mathbf{x} &\mapsto \hat{\mathbf{y}}(\mathbf{x}) = b + w \mathbf{x} \end{align}

where xRm\mathbf{x} \in \mathbb{R}^{m} is some input (not be necessary the training).

Note: we remark y^\hat{\mathbf{y}} with bold because given a vector x\mathbf{x}, y^\hat{\mathbf{y}} is a vector too.

Given an input x\mathbf{x}

y^=b+wx=b+w[x1xm]=[b+wx1b+wxm]\begin{align} \hat{\mathbf{y}} &= b + w \mathbf{x} \\ &= b + w \begin{bmatrix} x_{1} \\ \vdots \\ x_{m} \end{bmatrix} \\ &= \begin{bmatrix} b + wx_{1} \\ \vdots \\ b + wx_{m} \end{bmatrix} \end{align}

Note: we are going to call y^\hat{\mathbf{y}} as predicted data.

Remark: We can add a scalar bRb \in \mathbb{R} to a vector by broadcasting mechanism.

@add_to_class(SimpleLinearRegression)
def predict(self, x: torch.Tensor) -> torch.Tensor:
    """
    Predict the output for input x.

    Args:
        x: Input tensor of shape (n_samples,).

    Returns:
        y_pred: Predicted output tensor of shape (n_samples,).
    """
    return self.b + self.w * x

MSE

We need a loss function LL to help guide the adjustment of our parameters during training. We will use Mean Squared Error (MSE) as loss function

L:RmR+y^L(y^),  y^Rm\begin{align} L: \mathbb{R}^{m} &\to \mathbb{R}^{+} \\ \hat{\mathbf{y}} &\mapsto L(\hat{\mathbf{y}}), \; \hat{\mathbf{y}} \in \mathbb{R}^{m} \end{align}

MSE is defined as

L(y^)=1mi=1m(y^iyi)2L(\hat{\mathbf{y}}) = \frac{1}{m} \sum_{i=1}^{m} \left( \hat{y}_{i} - y_{i} \right)^{2}

or using a vectorized form

L(y^)=1my^y22L (\hat{\mathbf{y}}) = \frac{1}{m} \left\| \hat{\mathbf{y}} - \mathbf{y} \right\|^{2}_{2}

where x2\left\| \mathbf{x} \right\|_{2} is the Euclidean norm or is also called 2\ell_{2} norm (L2 norm).

@add_to_class(SimpleLinearRegression)
def mse_loss(self, y_true: torch.Tensor, y_pred: torch.Tensor):
    """
    MSE loss function between target y_true and y_pred.

    Args:
        y_true: Target tensor of shape (n_samples,).
        y_pred: Predicted tensor of shape (n_samples,).

    Returns:
        loss: MSE loss between predictions and true values.
    """
    return ((y_pred - y_true)**2).mean().item()

@add_to_class(SimpleLinearRegression)
def evaluate(self, x: torch.Tensor, y_true: torch.Tensor):
    """
    Evaluate the model on input x and target y_true using MSE.

    Args:
        x: Input tensor of shape (n_samples,).
        y_true: Target tensor of shape (n_samples,).

    Returns:
        loss: MSE loss between predictions and true values.
    """
    y_pred = self.predict(x)
    return self.mse_loss(y_true, y_pred)

gradients

To make adjustments to our model, it is necessary to compute derivatives.

  • First, determine the derivatives to be computed

  • Then, ascertain the size of each derivative

  • Finally, compute the derivatives

⭐️ We are using Einstein notation, that implies summation. For example

aibiiaibia_{i} b_{i} \equiv \sum_{i} a_{i} b_{i}

we will use Einstein notation for chain rule summation, for example

fgigixifgigix\frac{\partial f}{\partial g_{i}} \frac{\partial g_{i}}{\partial x} \equiv \sum_{i} \frac{\partial f}{\partial g_{i}} \frac{\partial g_{i}}{\partial x}

Using chain rule, we can determine the derivatives we need. Gradient of MSE respect to bias is

Lb=Ly^py^pb\frac{\partial L}{\partial b} = \frac{\partial L}{\partial \hat{y}_{p}} \frac{\partial \hat{y}_{p}}{\partial b}

Gradient of MSE respect to weight is

Lw=Ly^py^pw\frac{\partial L}{\partial w} = \frac{\partial L}{\partial \hat{y}_{p}} \frac{\partial \hat{y}_{p}}{\partial w}

Okay, we have determined the derivatives we need. Now let’s calculate their shapes

LbR,LwR,Ly^Rm,y^bRm,y^wRm\frac{\partial L}{\partial b} \in \mathbb{R}, \frac{\partial L}{\partial w} \in \mathbb{R}, \frac{\partial L}{\partial \hat{\mathbf{y}}} \in \mathbb{R}^{m}, \frac{\partial \hat{\mathbf{y}}}{\partial b} \in \mathbb{R}^{m}, \frac{\partial \hat{\mathbf{y}}}{\partial w} \in \mathbb{R}^{m}

All right, now we can compute each derivative.

MSE derivative

The derivative of MSE respect to predict data

Ly^p=y^p(1mi=1m(y^iyi)2)=1mi=1my^p((y^iyi)2)=2mi=1m(y^iyi)y^iy^p=2mi=1m(y^iyi)δip=2mi=1m[y^y]iδip=2m[y^y]p=2m(y^pyp)\begin{align} \frac{\partial L}{\partial \hat{y}_{p}} &= \frac{\partial}{\partial \hat{y}_{p}} \left( \frac{1}{m} \sum_{i=1}^{m} \left(\hat{y}_{i} - y_{i} \right)^{2} \right) \\ &= \frac{1}{m} \sum_{i=1}^{m} \frac{\partial}{\partial \hat{y}_{p}} \left( \left(\hat{y}_{i} - y_{i} \right)^{2} \right) \\ &= \frac{2}{m} \sum_{i=1}^{m} \left(\hat{y}_{i} - y_{i} \right) \frac{\partial \hat{y}_{i}}{\partial \hat{y}_{p}} \\ &= \frac{2}{m} \sum_{i=1}^{m} \left(\hat{y}_{i} - y_{i} \right) \delta_{ip} \\ &= \frac{2}{m} \sum_{i=1}^{m} \left[\hat{\mathbf{y}} - \mathbf{y} \right]_{i} \delta_{ip} \\ &= \frac{2}{m} \left[\hat{\mathbf{y}} - \mathbf{y} \right]_{p} \\ &= \frac{2}{m} \left(\hat{y}_{p} - y_{p} \right) \end{align}

for p=1,,mp = 1, \ldots, m.

Remark: Kronecker delta δij\delta_{ij} is defined as

δij={1if i=j0if ij\delta_{ij} = \begin{cases} 1 & \text{if } i=j \\ 0 & \text{if } i \neq j \end{cases}

and for any tensor a\mathbf{a}, aiδij=aj\mathbf{a}_{i} \delta_{ij} = \mathbf{a}_{j}.

The vectorized form is

Ly^=2m(y^y)\frac{\partial L}{\partial \hat{\mathbf{y}}} = \frac{2}{m} \left( \hat{\mathbf{y}} - \mathbf{y} \right)

weighted sum derivative

respect to bias

The derivative of weighted sum respect to bias is

y^pb=b(b+wxp)=1\begin{align} \frac{\partial \hat{y}_{p}}{\partial b} &= \frac{\partial}{\partial b} \left(b + w x_{p} \right) \\ &= 1 \end{align}

for all p=1,,mp = 1, \ldots, m.

Then, the vectorized form is

y^b=1\frac{\partial \hat{\mathbf{y}}}{\partial b} = \mathbf{1}

where 1Rm\mathbf{1} \in \mathbb{R}^{m}.

respect to weight

The derivative of weighted sum respect to weight is

y^pw=w(b+wxp)=w(wxp)=xp\begin{align} \frac{\partial \hat{y}_{p}}{\partial w} &= \frac{\partial}{\partial w} \left( b + w x_{p} \right) \\ &= \frac{\partial}{\partial w} \left( w x_{p} \right) \\ &= x_{p} \end{align}

for all p=1,,mp = 1, \ldots, m.

Then, the vectorized form is

y^w=x\frac{\partial \hat{\mathbf{y}}}{\partial w} = \mathbf{x}

full chain rule

Derivative of MSE respect to bias is

Lb=Ly^py^pb=2m(y^pyp)1p=2m<y^y,1>=2m(y^y)1\begin{align} \frac{\partial L}{\partial b} &= {\color{Cyan} \frac{\partial L}{\partial \hat{y}_{p}} } {\color{Orange} \frac{\partial \hat{y}_{p}}{\partial b} } \\ &= {\color{Cyan} \frac{2}{m} \left(\hat{y}_{p} - y_{p} \right)} {\color{Orange} 1_{p}} \\ &= \frac{2}{m} \left<\hat{\mathbf{y}} - \mathbf{y}, \mathbf{1} \right> \\ &= \frac{2}{m} \left( \hat{\mathbf{y}} - \mathbf{y} \right)^{\top} \mathbf{1} \end{align}

Derivative of MSE respect to weight is

Lw= Ly^py^pw=2m(y^pyp)xp=2m<y^y,x>=2m(y^y)x\begin{align} \frac{\partial L}{\partial w} &= \ {\color{Cyan} \frac{\partial L}{\partial \hat{y}_{p}}} {\color{Magenta} \frac{\partial \hat{y}_{p}}{\partial w}} \\ &= {\color{Cyan} \frac{2}{m} \left(\hat{y}_{p} - y_{p} \right)} {\color{Magenta} x_{p}} \\ &= \frac{2}{m} \left<\hat{\mathbf{y}} - \mathbf{y}, \mathbf{x} \right> \\ &= \frac{2}{m} \left( \hat{\mathbf{y}} - \mathbf{y} \right)^{\top} \mathbf{x} \end{align}

Note: We are using dot product ab\mathbf{a^{\top}b} as inner product <a,b>\left< \mathbf{a,b} \right>.

Remark: Einstein notation implies summation

(y^pyp)ap=[y^y]papp(y^pyp)ap(\hat{y}_{p} - y_{p}) a_{p} = [\hat{\mathbf{y}} - \mathbf{y}]_{p} a_{p} \equiv \sum_{p} (\hat{y}_{p} - y_{p}) a_{p}

final gradients

bL=Lb=2m(y^y)1\nabla_{b}L = \frac{\partial L}{\partial b} = \frac{2}{m} \left( \hat{\mathbf{y}} - \mathbf{y} \right)^{\top} \mathbf{1}
wL=Lw=2m(y^y)x\nabla_{w} L = \frac{\partial L}{\partial w} = \frac{2}{m} \left( \hat{\mathbf{y}} - \mathbf{y} \right)^{\top} \mathbf{x}

parameters update

Now, let’s update the trainable parameters using gradient descent (GD) as follows

bbηbL=bη(2m(y^y)1)\begin{align} b &\leftarrow b -\eta \nabla_{b}L \\ &= b -\eta \left( \frac{2}{m} (\hat{\mathbf{y}} - \mathbf{y})^{\top} \mathbf{1} \right) \end{align}
wwηwL=wη(2m(y^y)x)\begin{align} w &\leftarrow w -\eta \nabla_{w}L \\ &= w -\eta \left( \frac{2}{m} (\hat{\mathbf{y}} - \mathbf{y})^{\top} \mathbf{x} \right) \end{align}

where ηR+\eta \in \mathbb{R}^{+} is called learning rate.

@add_to_class(SimpleLinearRegression)
def update(self, x: torch.Tensor, y_true: torch.Tensor, 
           y_pred: torch.Tensor, lr: float):
    """
    Update the model parameters.

    Args:
       x: Input tensor of shape (n_samples,).
       y_true: Target tensor of shape (n_samples,).
       y_pred: Predicted output tensor of shape (n_samples,).
       lr: Learning rate. 
    """
    delta = 2 * (y_pred - y_true) / len(y_true)
    self.b -= lr * delta.sum()
    self.w -= lr * torch.dot(delta, x)

gradient descent

We will use mini-batch gradient descent (mini-batch GD) to adjust the parameters of our model

Algorithm: mini-batch Gradient Descentfor t=1 to T doi1jBwhile i<m doθupdate(xtrain i:j,ytrain i:j;θ)ii+Bjj+Bend for\begin{array}{l} \textbf{Algorithm: mini-batch Gradient Descent} \\ \textbf{for } t = 1 \text{ to } T \textbf{ do} \\ \quad i \leftarrow 1 \\ \quad j \leftarrow \mathcal{B} \\ \quad \textbf{while } i < m \textbf{ do} \\ \quad \quad \mathbf{\theta} \leftarrow \text{update}(\mathbf{x}_{\text{train } i:j}, \mathbf{y}_{\text{train } i:j}; \mathbf{\theta}) \\ \quad \quad i \leftarrow i + \mathcal{B} \\ \quad \quad j \leftarrow j + \mathcal{B} \\ \textbf{end for} \end{array}

where:

  • TT is the number of epochs

  • θ\theta is an arbitrary model’s parameter, in our case are bb and ww

  • B\mathcal{B} is the number of samples per minibatch

  • xtrain i:j\mathbf{x}_{\text{train } i:j} and ytrain i:j\mathbf{y}_{\text{train } i:j} are the ii-th to jj-th train samples

Note: η,T,B\eta, T, \mathcal{B} are called hyperparameters, because they are adjusted by the developer rather than the model.

To learn more about types of gradient descents, please see gradient descents.

@add_to_class(SimpleLinearRegression)
def fit(self, x: torch.Tensor, y: torch.Tensor, 
        epochs: int, lr: float, batch_size: int, 
        x_valid: torch.Tensor, y_valid: torch.Tensor):
    """
    Fit the model using gradient descent.
    
    Args:
        x: Input tensor of shape (n_samples,).
        y: Target tensor of shape (n_samples,).
        epochs: Number of epochs to fit.
        lr: learning rate.
        batch_size: Int number of batch.
        x_valid: Input tensor of shape (n_valid_samples,).
        y_valid: Target tensor of shape (n_valid_samples,).
    """
    for epoch in range(epochs):
        loss = []
        for batch in range(0, len(y), batch_size):
            end_batch = batch + batch_size

            y_pred = self.predict(x[batch:end_batch])

            loss.append(self.mse_loss(
                y[batch:end_batch],
                y_pred
            ))

            self.update(
                x[batch:end_batch], 
                y[batch:end_batch], 
                y_pred, 
                lr
            )

        loss = round(sum(loss) / len(loss), 4)
        loss_v = round(self.evaluate(x_valid, y_valid), 4)
        print(f'epoch: {epoch} - MSE: {loss} - MSE_v: {loss_v}')

Scrath vs Torch.nn

We will be implementing a model created with PyTorch’s pre-built classes for linear regression. This will allow us to compare our model from scratch with the PyTorch model.

Torch.nn model

class TorchLinearRegression(nn.Module):
    def __init__(self, n_features):
        super(TorchLinearRegression, self).__init__()
        self.layer = nn.Linear(n_features, 1, device=device)
        self.loss = nn.MSELoss()

    def forward(self, x):
        return self.layer(x)
    
    def evaluate(self, x, y):
        self.eval()
        with torch.no_grad():
            y_pred = self.forward(x)
            return self.loss(y_pred, y).item()
    
    def fit(self, x, y, epochs, lr, batch_size, x_valid, y_valid):
        optimizer = torch.optim.SGD(self.parameters(), lr=lr)
        for epoch in range(epochs):
            loss_t = [] # train loss
            for batch in range(0, len(y), batch_size):
                end_batch = batch + batch_size

                y_pred = self.forward(x[batch:end_batch])
                loss = self.loss(y_pred, y[batch:end_batch])
                loss_t.append(loss.item())

                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

            loss_t = round(sum(loss_t) / len(loss_t), 4)
            loss_v = round(self.evaluate(x_valid, y_valid), 4)
            print(f'epoch: {epoch} - MSE: {loss_t} - MSE_v: {loss_v}')
        optimizer.zero_grad()
torch_model = TorchLinearRegression(1)

scratch model

model = SimpleLinearRegression()

evals

We will use a metric to compare our model with the PyTorch model.

import MAPE modified

We will use a modification of MAPE as a metric

MAPE(y,y^)=1mi=1mL(yi,y^i)\text{MAPE}(\mathbf{y}, \hat{\mathbf{y}}) = \frac{1}{m} \sum^{m}_{i=1} \mathcal{L} (y_{i}, \hat{y}_{i})

where

L(yi,y^i)={yiy^iyiif yi0y^iif y^i=0\mathcal{L} (y_{i}, \hat{y}_{i}) = \begin{cases} \left| \frac{y_{i} - \hat{y}_{i}}{y_{i}} \right| & \text{if } y_{i} \neq 0 \\ \left| \hat{y}_{i} \right| & \text{if } \hat{y}_{i} = 0 \end{cases}
# This cell imports torch_mape 
# if you are running this notebook locally 
# or from Google Colab.

import os
import sys

module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

try:
    from tools.torch_metrics import torch_mape as mape
    print('mape imported locally.')
except ModuleNotFoundError:
    import subprocess

    repo_url = 'https://raw.githubusercontent.com/PilotLeoYan/inside-deep-learning/main/content/tools/torch_metrics.py'
    local_file = 'torch_metrics.py'
    
    subprocess.run(['wget', repo_url, '-O', local_file], check=True)
    try:
        from torch_metrics import torch_mape as mape # type: ignore
        print('mape imported from GitHub.')
    except Exception as e:
        print(e)
mape imported locally.

predictions

Let’s compare the predictions of our model and PyTorch’s using modified MAPE.

mape(
    model.predict(X_valid),
    torch_model.forward(X_valid.unsqueeze(-1)).squeeze(-1)
)
9.959147478579956

They differ considerably because each model has its own parameters initialized randomly and independently of the other model.

copy parameters

We copy the values of the PyTorch model parameters to our model.

model.copy_params(torch_model.layer)

predictions after copy parameters

We measure the difference between the predictions of both models again.

mape(
    model.predict(X_valid),
    torch_model.forward(X_valid.unsqueeze(-1)).squeeze(-1)
)
0.0

We can see that their predictions do not differ greatly.

loss

mape(
    model.evaluate(X_valid, Y_valid),
    torch_model.evaluate(X_valid.unsqueeze(-1), Y_valid.unsqueeze(-1))
)
0.0

training

We are going to train both models using the same hyperparameters’ value. If our model is well designed, then starting from the same parameters it should arrive at the same parameters’ values as the PyTorch model after training.

LR: float = 0.01 # learning rate
EPOCHS: int = 16 # number of epochs
BATCH: int = len(X_train) // 3 # number of minibatch
torch_model.fit(
    X_train.unsqueeze(-1), 
    Y_train.unsqueeze(-1),
    EPOCHS, LR, BATCH,
    X_valid.unsqueeze(-1),
    Y_valid.unsqueeze(-1)
)
epoch: 0 - MSE: 3858.4284 - MSE_v: 3678.3555
epoch: 1 - MSE: 3279.5901 - MSE_v: 3158.1161
epoch: 2 - MSE: 2791.5492 - MSE_v: 2715.3176
epoch: 3 - MSE: 2379.5309 - MSE_v: 2337.8917
epoch: 4 - MSE: 2031.2332 - MSE_v: 2015.7238
epoch: 5 - MSE: 1736.4046 - MSE_v: 1740.3273
epoch: 6 - MSE: 1486.4949 - MSE_v: 1504.5734
epoch: 7 - MSE: 1274.3664 - MSE_v: 1302.4666
epoch: 8 - MSE: 1094.0545 - MSE_v: 1128.9585
epoch: 9 - MSE: 940.5701 - MSE_v: 979.7926
epoch: 10 - MSE: 809.736 - MSE_v: 851.3757
epoch: 11 - MSE: 698.0505 - MSE_v: 740.6702
epoch: 12 - MSE: 602.575 - MSE_v: 645.105
epoch: 13 - MSE: 520.8409 - MSE_v: 562.5008
epoch: 14 - MSE: 450.7716 - MSE_v: 491.0079
epoch: 15 - MSE: 390.6182 - MSE_v: 429.0538
model.fit(
    X_train, Y_train,
    EPOCHS, LR, BATCH,
    X_valid, Y_valid
)
epoch: 0 - MSE: 3858.4284 - MSE_v: 3678.3555
epoch: 1 - MSE: 3279.5901 - MSE_v: 3158.1161
epoch: 2 - MSE: 2791.5492 - MSE_v: 2715.3176
epoch: 3 - MSE: 2379.5309 - MSE_v: 2337.8917
epoch: 4 - MSE: 2031.2332 - MSE_v: 2015.7238
epoch: 5 - MSE: 1736.4046 - MSE_v: 1740.3273
epoch: 6 - MSE: 1486.4949 - MSE_v: 1504.5734
epoch: 7 - MSE: 1274.3664 - MSE_v: 1302.4666
epoch: 8 - MSE: 1094.0545 - MSE_v: 1128.9585
epoch: 9 - MSE: 940.5701 - MSE_v: 979.7926
epoch: 10 - MSE: 809.736 - MSE_v: 851.3757
epoch: 11 - MSE: 698.0505 - MSE_v: 740.6702
epoch: 12 - MSE: 602.575 - MSE_v: 645.105
epoch: 13 - MSE: 520.8409 - MSE_v: 562.5008
epoch: 14 - MSE: 450.7716 - MSE_v: 491.0079
epoch: 15 - MSE: 390.6182 - MSE_v: 429.0538

predictions after training

mape(
    model.predict(X_valid),
    torch_model.forward(X_valid.unsqueeze(-1)).squeeze(-1)
)
0.0

bias

We directly measure the difference between the bias values of both models.

mape(
    model.b.clone(),
    torch_model.layer.bias.detach()
)
0.0

weight

And measure the difference between the weight values of both models.

mape(
    model.w.clone(),
    torch_model.layer.weight.detach().squeeze(0)
)
0.0

All right, our scrath simple linear regression is well done respect to PyTorch’s implementation.