# Build a Linear Regression Model

Jishnu Prasad SamalLinear Regression is one of the oldest and widely used Machine Learning algorithm which is used to train a model against two variables - Independent Variable and Dependent (Target) Variable. If you wish to learn more about AI and Machine Learning, you may see my blog on Artificial Intelligence.

In this project, I will be training a model to predict Sports Sustainability. Sustainability in sports means conducting a sporting event that utilises environmentally friendly methods to reduce the negative impact on the environment. Just like every industry, sports has a supply chain issue. When enjoying sports like Cricket or Football, we tend to forget about environment.

But we need to think beyond the tournament: who is making players’ kits and boots? Where is the water feeding the pitch coming from? How was the stadium built, and how is it maintained? What’s the impact of major tournaments like the Champions’ League or the Olympics, where hastily erected stadiums and hundreds of thousands of fans take over local areas?. Moreover, are certain actions – like disposable cups in stadiums, or the use of recycled fibres in kits – simply a sticking plaster over much wider issues across the entire supply chain?

For this project we have a dataset containing data about the number of suppliers of sports goods and the corresponding carbon emissions from them (in metric tons). I am going to use Scikit-learn for training and evaluating the model. Before building the model, we need to analyse the data, for any errors or ambiguities, and identify the trends in the data. I will be using Pandas and Numpy for data analysis and Matplotlib as plotting library for plotting graphs and charts. So, without wasting any further time, let's jump right in.

## Bootstrapping

First, let's install and import the required packages as shown in the codeblock below.

%pip install pandas numpy matplotlib

import pandas as pdimport numpy as npimport matplotlib.pyplot as plt

## Data Analysis with Pandas

With the project dependencies installed, let's move forward to import our dataset using pandas. In the first line, I have created the dataframe by importing dataset with `pd.read_csv()`

function and in the second line, I dropped the year column as it does not provide any relevant information for our model. This is an example dataset and has only 16 data points, but production models are trained on much larger datasets containing hundreds and thousands of data points.

Then, we will get the columns using `df.columns`

and the shape of the dataset using `df.shape`

.

Then, we will get information about the dataset using `df.info()`

. It gives us the columns of the dataset, non-null count which means the number of values in the column which are not null and the data type of the values in the column.

Then, I am going to analyze the mean, median, standard deviation, count, min and max and several percentages of the data using `df.describe()`

.

Now, plot a scatter plot for the data points using pandas.

## Model Training

Now, we have completed analyzing the data, so, we will begin with model training. Before starting with model training, we are going to import the required modules.

from sklearn.model_selection import train_test_splitfrom sklearn.linear_model import LinearRegressionfrom sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error import joblib

Now, we need to initialize two variables - `x`

which is our independent variable and `Y`

which is our dependent or target variable. `x`

contains the `Number of Suppliers`

feature and `Y`

contains the `Carbon Emissions from Suppliers (metric tons)`

which we need to predict.

Now, we will split our dataset into training data and testing data. In the code block below, we are `x`

to a `numpy`

array and then assigning one part to `x_train`

which is our training data from `x`

variable and another part to `y_train`

which is our training data from `Y`

variable. Similarly, we are creating `x_test`

and `y_test`

which holds our testing data. The size of testing dataframe is `0.2`

or approximately `20%`

of the original dataset.

Now, finally, it's time to fit the data into the model. To do so, first, we need to initialize the `LinearRegression()`

class imported from `scikit-learn`

with a `model`

variable. Then we use `model.fit()`

to fit the data and our model is ready to use.

Now, let's make predictions using the model built in the previous step. Here, we need to pass the `Number of Suppliers`

as a `numpy`

array into the `model.predict()`

function. And we get the `Carbon Emission in metric tons`

as our output.

## Model Evaluation

We have successfully built our model from scratch in the previous step. But, now, we need to evaluate our model's accuracy and performance. This is a very crucial step in Machine Learning Lifecycle. So, now let's begin with model evaluation. In model training stage, we created `x_test`

and `y_test`

and now we will be using those two to test the model.

I will be creating an array named `y_pred`

which will contain predicted values of data points of `x_test`

.

We will be using score, intercept, coefficient and R² score. The score of the model is 99.671% at the time of publishing this blog, which is quite good. R² score is quite fine.

Now, I am going to compare the values of `y_test`

which is the actual value of the data points and the values of `y_pred`

which are the predicted values. I will do this by plot a graph of Actual vs Predicted values.

## Saving the Model

Now, we need to save our trained model for future use. We will pickle the model using `joblib`

package.

def save_model(model):joblib.dump(model, open('model.jlib', 'wb+'))save_model(model)

Now, we have saved our model. Let's try out the saved model by loading it as `saved_model`

and then make predictions using the saved model.

saved_model = joblib.load('model.jlib')

As we can see above, the saved model gives the same output as the original model.

## Final Thoughts

In this blog, I demonstrated how to build a Linear Regression model to predict the sustainability in sports. We used **Pandas** and **Matplotlib** to analyze the data and then used **Scikit-learn** to train the model. After Model Evaluation, the last step is Model Deployment. I will demonstrate deploying ML models on **Hugging Face Spaces** using **Gradio** Framework in some other blog. I have already deployed this model, so if you want to try it out you may check out the link below.

Deployed Model - https://jishnupsamal-sports-sustainability.hf.space