A Step-by-Step Guide to Linear Regression in Machine Learning

A Step-by-Step Guide to Linear Regression in Machine Learning

Introduction:

In the vast landscape of machine learning, understanding the basics is crucial, and linear regression is an excellent starting point. In this blog post, we'll learn about linear regression by breaking down the concepts step-by-step. But we won't stop at theory, we'll also delve into coding linear regression from scratch, enabling you to understand it from the depth.

Step 1: Understanding the Basics

At its core, linear regression involves predicting an outcome based on one or more input variables. Imagine trying to predict the score a student might achieve based on the number of hours they study – that's where linear regression comes in.

Step 2: The Equation

Let's start with equation of a straight line.

y = mx + c

  • Here m is the slope/gradient of the line

  • x is the coordinate of datapoint

  • c is the y intercept (where the line crosses the y-axis)

Here's how it translates into our student example:

Score = Study Hours * Study Efficiency + Baseline Score

Here, the student's score Score depends on two things: the number of hours they study Study Hours and how efficiently they study Study Efficiency. The slope m tells us how much the score changes for each additional hour of study, and the y-intercept c is the baseline score achieved with zero study hours.

So, in essence, the equation is a tool that helps us predict an outcome based on the relationship between variables. It's the foundation of our journey into understanding and utilizing linear regression in the world of machine learning.

Step 3: Training the Model

Training the model involves finding the optimal values for m and c The key method we employ is the "least squares" approach, which works towards minimizing the difference between our predicted and actual values.

Least Squares: Getting to the Core

Least squares is pretty straightforward. It's a method that aims to minimize the sum of the squared differences between our predicted values and the actual values. Imagine adjusting our parameters m and c so that our predicted line fits snugly through our data points, minimizing the gaps between them.

Introduction to Gradient Descent

In our quest for optimal values, we introduce another concept called "Gradient Descent." This is a technique that helps us iteratively adjust our parameters to reduce the difference between our predictions and the real values. Think of it as a step-by-step process, gradually refining our predictions.

But what exactly is Gradient Descent?

In simple terms, it's like finding the best path down a hill. We're trying to adjust m and c in the direction that minimizes the difference between our predictions and the actual outcomes. It's a practical approach to fine-tuning our model.

The world of Gradient Descent is vast, and we'll explore its details in a future blog post. This method plays a crucial role in optimizing our model, and we'll dive deeper into its mechanics.

For those curious about Gradient Descent right away, you can check out this amazing blog [link]. Otherwise, stay tuned as we continue our journey through the basics of linear regression.

Step 4: Evaluation

Having trained our model, it's time to assess its performance. The metric we'll employ for this task is the Mean Squared Error MSE, a reliable measure that quantifies the average difference between our predicted values and the actual outcomes.

Why MSE? While there are various methods to measure error, MSE is particularly favored for its ability to penalize larger errors more significantly. This makes it a suitable choice when we want to prioritize minimizing the impact of substantial prediction deviations.

Step 5: Visualizing the Model

To gain insights, we'll create a scatter plot with our regression line to visualize the relationship between the variables.

Step 6: Real-world Applications

Linear regression finds applications in predicting housing prices, stock values, and much more. Its simplicity makes it a powerful tool for understanding and predicting real-world phenomena.

Step 7: Coding Linear Regression from Scratch

Now, let's transition from theory to practice. We'll code a simple linear regression model in Python, and then evaluate it's performance on unseen data.

Step 7.1 Importing Libraries:

import numpy as np 
from sklearn import datasets 
from sklearn.model_selection import train_test_split 
import matplotlib.pyplot as plt
  • numpy for matrix operations.

  • datasets from sklearn for generating a regression dataset.

  • train_test_split from sklearn for splitting the dataset into training and testing sets.

  • matplotlib.pyplot for data visualization.

Step 7.2 Define Linear Regression Class:

class Linear_Regression:
    def __init__(self, lr=0.1, n_iters=100):
        self.weights = None
        self.bias = None
        self.lr = lr
        self.n_iters = n_iters

Here we define a class Linear_Regression to encapsulate linear regression functionality. It has an initialization method setting default learning rate (lr) and number of iterations (n_iters). weights is initialized to none because each feature will have its own weight, i.e. no of weights are equal to the no of features. But there will be only 1 bias.

So, if we have n features the equation of our line will be like this:

y = θ0 + θ1X1+ θ2X2 + . . . + θnXn

Step 7.3 Fit Method:

def fit(self, X, y):
    n_samples, n_features = X.shape
    self.weights = np.zeros(n_features)
    self.bias = 0

    for _ in range(self.n_iters):
        '''
            Here,
            fs are features
                 f1  f2   f3  f4  f5  
            X = [x11,x12,x13,x14,x15]  weights = [w1]  bias = bias
                [x21,x22,x23,x24,x25]            [w2]
                [x31,x32,x33,x34,x35]            [w3]
                [x41,x42,x43,x44,x45]            [w4]
                [x51,x52,x53,x54,x55]            [w5]
            '''
        y_pred = np.dot(X, self.weights) + self.bias

        dw = (1/n_samples) * np.dot(X.T, (y_pred - y))
        db = (1/n_samples) * np.sum(y_pred - y)

        self.weights = self.weights - self.lr * dw
        self.bias = self.bias - self.lr * db

The fit method trains the linear regression model using gradient descent. It initializes weights and bias, iteratively updates them to minimize the error.

Step 7.4 Predict Method

def predict(self, X_test):
    predicted = np.dot(X_test, self.weights) + self.bias
    return predicted

The predict method takes test data and uses the trained weights and bias to make predictions.

Step 7.5 Main Execution Block

def mse(y1, y2):
    return np.mean((y2 - y1)**2)

if __name__ == "__main__":
    # Data Generation and Splitting
    X, y = datasets.make_regression(n_samples=300, n_features=1, noise=10, random_state=42)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=3)

    # Model Initialization and Training
    model = Linear_Regression()
    model.fit(X_train, y_train)

    # Model Prediction and Evaluation
    preds = model.predict(X_test)
    print("Mean Squared Error:", mse(preds, y_test))

    # Visualization
    fig = plt.figure(figsize=(8, 6))
    predictions = model.predict(X)
    cmap = plt.get_cmap('viridis')
    plt.scatter(X_train,y_train,color=cmap(0.9),s=10,label='Training Data')
    plt.scatter(X_test,y_test,color=cmap(0.5),s=10,label='Test Data')
    plt.plot(X,predictions,color="black",linewidth=2,label="Best Fit Line")
    plt.legend()
    plt.show()
  • Here we generate a synthetic regression data, splits it into training and testing sets.

  • Initialize the linear regression model.

  • Train the model on the training data.

  • Makes predictions on the test data and evaluates the model using Mean Squared Error.

  • Finally, visualize the training and testing data along with the regression line.

7.6 Best Fit Line:

Points to Note: Common Misconceptions

Misconception 1: Linear Regression Assumes Linearity in All Cases

While linear regression assumes a linear relationship between variables, it doesn't mean that the variables themselves must be linear. Transformations can be applied to make the relationship linear, even if the original variables are not.

Explanation: This misconception often arises from the name "linear regression." People may assume that the technique is only applicable when relationships are strictly linear. In reality, it's about the linearity in the coefficients, not necessarily the raw variables.

Misconception 2: Outliers Always Negatively Affect Linear Regression

While outliers can influence linear regression models, they don't always have a negative impact. Sometimes outliers contain valuable information or highlight specific patterns in the data.

Explanation: Outliers may disproportionately affect the model if they have a substantial impact on the overall pattern. However, not all outliers are detrimental; they can represent unique scenarios or anomalies that are important to capture.

When to Apply Linear Regression: Key Points to Consider

  1. Linearity: Linear regression is most effective when there is a linear relationship between the input and output variables. Visualizing the data through scatter plots can help identify linearity.

  2. Homoscedasticity: The variance of the errors should be consistent across all levels of the independent variable. If the spread of errors widens or narrows systematically, it indicates heteroscedasticity, which may violate linear regression assumptions.

  3. Independence: Observations should be independent of each other. For example, in time-series data, consecutive observations may be correlated, violating the independence assumption.

  4. Normality of Residuals: The residuals (the differences between actual and predicted values) should be approximately normally distributed. This assumption is important for statistical inference.