Hey readerHope you are doing well
In the last post we have read about how we can minimize our cost function using gradient descent algorithm.
In this post we are going to discuss about the implementation of Linear Regression in python using scikit-learn.
So let’s get started
Assumptions for Linear Regression
Linear Regression does not work for every dataset. Therefore we have few assumptions for a dataset so that linear regression perform well on it.
-
The core premise of multiple linear regression is the existence of a linear relationship between the dependent (outcome) variable and the independent variables.
-
The analysis assumes that the residuals (the differences between observed and predicted values) are normally distributed.
-
It is essential that the independent variables are not too highly correlated with each other, a condition known as multicollinearity.
Linear Regression Implementation
Linear regression is a statistical model which estimates the linear relationship between dependent(output) and independent(features) variables.
The dataset I have used for linear regression is Abalone Dataset.
So you can see the dataset consists of total 10 features out of which the last one is dependent feature.
The basic steps for implementation of any algorithm are as follows-:
-
Visualizing data, handling missing values, duplicates and outliers.
-
Feature Selection
-
Train the model
-
Check accuracy of models using different metrics.
Load the Dataset
We have imported the libraries that we need to load the dataset and visualize it.
Visualize the datatrain_df.head()
train_df.shape
train_df.info()
The first line is used to get first 5 rows of dataset. The second line is used to get the dimensions of dataset and the third line gives the information about features in a dataset.
For getting the information about statistics of a dataset we use following code-:
So now we have enough basic information about our dataset (no missing values, data type of each feature ,duplicates and statistics).
But you have noticed that our dataset contains categorical feature and we know that machine learning algorithms work efficiently with numerical data. So let’s convert the categorical data into numerical data.
(We will study about Handling Categorical Data in later blogs).
In this dataset we have just gender column as object. We will use LabelEncoder to convert the categorical feature into numerical.
(We have train as well as test dataframe here because this notebook was made for Kaggle Competition)
Here we have imported LabelEncoder from sklearn’s preprocessing module and then created an instance of LabelEncoder. At last we have fit the data and transform it into a form that is more suitable for the model in a single step.
Now we will check for outliers in the dataset. For this I am using boxplot and IQR detection here.
The boxplots are very good approach to detect outliers. In this dataset we have significant amount of outliers so removing them will shorten the data. So we will retain the outliers here.
And at last we will check for linearity of dataset and collinearity. I will use correlation and heatmap for this purpose.
Here the data for heatmap is the resultant matrix that we get from .corr()
(it gives a correlation between every variable).
So the collinearity values shows that we have moderate correlation between independent and dependent variables also dependent variable is linearly dependent on independent variables. We can use scatter plots and statistical tests for checking for linearity and other things. But in this blog we will stick to implementation only.
Splitting the data into dependent and independent set
So here we have taken all the important independent features in X_train and dependent feature in y_train and test data in X_test.
Generate Model
So here we have imported LinearRegression from sklearn’s linear_model module, created an instance, fit the training data and on the basis of this predicted the output for test data.
Now the last step is checking the accuracy and certain other metrics of model such as r2 score, MSE , RMSE etc. We will see about these in later blogs. I have provided just implementation here.
I hope you have understood this blog.
You can see the notebook here-:
[https://www.kaggle.com/code/nehagupta09/regression-with-abalone-dataset]
For more please follow me.
Thankyou