Table of contents
- Introduction:
Introduction:
In the vast landscape of machine learning, understanding the basics is crucial, and linear regression is an excellent starting point. In this blog post, we'll learn about linear regression by breaking down the concepts step-by-step. But we won't stop at theory, we'll also delve into coding linear regression from scratch, enabling you to understand it from the depth.
Step 1: Understanding the Basics
At its core, linear regression involves predicting an outcome based on one or more input variables. Imagine trying to predict the score a student might achieve based on the number of hours they study – that's where linear regression comes in.
Step 2: The Equation
Let's start with equation of a straight line.
y = mx + c
Here m is the slope/gradient of the line
x is the coordinate of datapoint
c is the y intercept (where the line crosses the y-axis)
Here's how it translates into our student example:
Score = Study Hours * Study Efficiency + Baseline Score
Here, the student's score Score
depends on two things: the number of hours they study Study Hours
and how efficiently they study Study Efficiency
. The slope m
tells us how much the score changes for each additional hour of study, and the y-intercept c
is the baseline score achieved with zero study hours.
So, in essence, the equation is a tool that helps us predict an outcome based on the relationship between variables. It's the foundation of our journey into understanding and utilizing linear regression in the world of machine learning.
Step 3: Training the Model
Training the model involves finding the optimal values for m
and c
The key method we employ is the "least squares" approach, which works towards minimizing the difference between our predicted and actual values.
Least Squares: Getting to the Core
Least squares is pretty straightforward. It's a method that aims to minimize the sum of the squared differences between our predicted values and the actual values. Imagine adjusting our parameters m
and c
so that our predicted line fits snugly through our data points, minimizing the gaps between them.
Introduction to Gradient Descent
In our quest for optimal values, we introduce another concept called "Gradient Descent." This is a technique that helps us iteratively adjust our parameters to reduce the difference between our predictions and the real values. Think of it as a step-by-step process, gradually refining our predictions.
But what exactly is Gradient Descent?
In simple terms, it's like finding the best path down a hill. We're trying to adjust m
and c
in the direction that minimizes the difference between our predictions and the actual outcomes. It's a practical approach to fine-tuning our model.
The world of Gradient Descent is vast, and we'll explore its details in a future blog post. This method plays a crucial role in optimizing our model, and we'll dive deeper into its mechanics.
For those curious about Gradient Descent right away, you can check out this amazing blog [link]. Otherwise, stay tuned as we continue our journey through the basics of linear regression.
Step 4: Evaluation
Having trained our model, it's time to assess its performance. The metric we'll employ for this task is the Mean Squared Error MSE, a reliable measure that quantifies the average difference between our predicted values and the actual outcomes.
Why MSE? While there are various methods to measure error, MSE is particularly favored for its ability to penalize larger errors more significantly. This makes it a suitable choice when we want to prioritize minimizing the impact of substantial prediction deviations.
Step 5: Visualizing the Model
To gain insights, we'll create a scatter plot with our regression line to visualize the relationship between the variables.
Step 6: Real-world Applications
Linear regression finds applications in predicting housing prices, stock values, and much more. Its simplicity makes it a powerful tool for understanding and predicting real-world phenomena.
Step 7: Coding Linear Regression from Scratch
Now, let's transition from theory to practice. We'll code a simple linear regression model in Python, and then evaluate it's performance on unseen data.
Step 7.1 Importing Libraries:
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
numpy
for matrix operations.datasets
fromsklearn
for generating a regression dataset.train_test_split
fromsklearn
for splitting the dataset into training and testing sets.matplotlib.pyplot
for data visualization.
Step 7.2 Define Linear Regression Class:
class Linear_Regression:
def __init__(self, lr=0.1, n_iters=100):
self.weights = None
self.bias = None
self.lr = lr
self.n_iters = n_iters
Here we define a class Linear_Regression
to encapsulate linear regression functionality. It has an initialization method setting default learning rate (lr
) and number of iterations (n_iters
). weights
is initialized to none because each feature will have its own weight, i.e. no of weights are equal to the no of features. But there will be only 1 bias
.
So, if we have n features the equation of our line will be like this:
y = θ0 + θ1X1+ θ2X2 + . . . + θnXn
Step 7.3 Fit Method:
def fit(self, X, y):
n_samples, n_features = X.shape
self.weights = np.zeros(n_features)
self.bias = 0
for _ in range(self.n_iters):
'''
Here,
fs are features
f1 f2 f3 f4 f5
X = [x11,x12,x13,x14,x15] weights = [w1] bias = bias
[x21,x22,x23,x24,x25] [w2]
[x31,x32,x33,x34,x35] [w3]
[x41,x42,x43,x44,x45] [w4]
[x51,x52,x53,x54,x55] [w5]
'''
y_pred = np.dot(X, self.weights) + self.bias
dw = (1/n_samples) * np.dot(X.T, (y_pred - y))
db = (1/n_samples) * np.sum(y_pred - y)
self.weights = self.weights - self.lr * dw
self.bias = self.bias - self.lr * db
The fit
method trains the linear regression model using gradient descent. It initializes weights and bias, iteratively updates them to minimize the error.
Step 7.4 Predict Method
def predict(self, X_test):
predicted = np.dot(X_test, self.weights) + self.bias
return predicted
The predict
method takes test data and uses the trained weights and bias to make predictions.
Step 7.5 Main Execution Block
def mse(y1, y2):
return np.mean((y2 - y1)**2)
if __name__ == "__main__":
# Data Generation and Splitting
X, y = datasets.make_regression(n_samples=300, n_features=1, noise=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=3)
# Model Initialization and Training
model = Linear_Regression()
model.fit(X_train, y_train)
# Model Prediction and Evaluation
preds = model.predict(X_test)
print("Mean Squared Error:", mse(preds, y_test))
# Visualization
fig = plt.figure(figsize=(8, 6))
predictions = model.predict(X)
cmap = plt.get_cmap('viridis')
plt.scatter(X_train,y_train,color=cmap(0.9),s=10,label='Training Data')
plt.scatter(X_test,y_test,color=cmap(0.5),s=10,label='Test Data')
plt.plot(X,predictions,color="black",linewidth=2,label="Best Fit Line")
plt.legend()
plt.show()
Here we generate a synthetic regression data, splits it into training and testing sets.
Initialize the linear regression model.
Train the model on the training data.
Makes predictions on the test data and evaluates the model using Mean Squared Error.
Finally, visualize the training and testing data along with the regression line.
7.6 Best Fit Line:
Points to Note: Common Misconceptions
Misconception 1: Linear Regression Assumes Linearity in All Cases
While linear regression assumes a linear relationship between variables, it doesn't mean that the variables themselves must be linear. Transformations can be applied to make the relationship linear, even if the original variables are not.
Explanation: This misconception often arises from the name "linear regression." People may assume that the technique is only applicable when relationships are strictly linear. In reality, it's about the linearity in the coefficients, not necessarily the raw variables.
Misconception 2: Outliers Always Negatively Affect Linear Regression
While outliers can influence linear regression models, they don't always have a negative impact. Sometimes outliers contain valuable information or highlight specific patterns in the data.
Explanation: Outliers may disproportionately affect the model if they have a substantial impact on the overall pattern. However, not all outliers are detrimental; they can represent unique scenarios or anomalies that are important to capture.
When to Apply Linear Regression: Key Points to Consider
Linearity: Linear regression is most effective when there is a linear relationship between the input and output variables. Visualizing the data through scatter plots can help identify linearity.
Homoscedasticity: The variance of the errors should be consistent across all levels of the independent variable. If the spread of errors widens or narrows systematically, it indicates heteroscedasticity, which may violate linear regression assumptions.
Independence: Observations should be independent of each other. For example, in time-series data, consecutive observations may be correlated, violating the independence assumption.
Normality of Residuals: The residuals (the differences between actual and predicted values) should be approximately normally distributed. This assumption is important for statistical inference.