Linear Regression

Linear Regression: A Comprehensive Guide

Linear regression is based on the equation of a straight line:

$latex (y = mx + c)

where:

  • (y) is the dependent variable (output)
  • (x) is the independent variable (input)
  • (m) is the slope of the line
  • (c) is the y-intercept

Derivation of the Equation

A straight line represents a constant rate of change. Given two points ((x_1, y_1)) and ((x_2, y_2)), the slope (m) is calculated as:

(m = \frac{y_2 – y_1}{x_2 – x_1})

Using this slope, the equation of the line passing through ((x_1, y_1)) is:

(y – y_1 = m (x – x_1))

Rearranging, we get the familiar form (y = mx + c), where (c) is the y-intercept.

The slope (m) determines how much (y) changes for a unit increase in (x). If (m) is positive, the line inclines upward; if negative, it declines downward.

Calculation of the Slope

For a dataset with multiple points ((x_i, y_i)), the slope in linear regression is calculated as:

(m = \frac{\sum (x_i – \bar{x})(y_i – \bar{y})}{\sum (x_i – \bar{x})^2})

where (\bar{x}) and (\bar{y}) are the means of (x) and (y) respectively.

3. What is a Linear Equation?

A linear equation represents a straight-line relationship between variables. It satisfies the property of additivity and homogeneity, meaning it models relationships with constant proportional change.

4. Formulating Linear Regression

Linear regression finds the best-fitting line through a dataset by minimizing the error between predicted and actual values. The equation is:

(\hat{y} = \beta_0 + \beta_1 x)

where:

  • (\hat{y}) is the predicted value
  • (\beta_0) is the intercept
  • (\beta_1) is the slope

Derivation Using Least Squares Method

We minimize the sum of squared errors:

(J(\beta_0, \beta_1) = \sum_{i=1}^{n} (y_i – \hat{y}_i)^2)

Taking partial derivatives with respect to (\beta_0) and (\beta_1) and solving for zero gives:

(\beta_1 = \frac{\sum (x_i – \bar{x})(y_i – \bar{y})}{\sum (x_i – \bar{x})^2}) (\beta_0 = \bar{y} – \beta_1 \bar{x})

5. Error Metrics in Linear Regression

Error metrics help evaluate model performance:

Mean Squared Error (MSE)

MSE measures the average squared differences between actual and predicted values:

(MSE = \frac{1}{n} \sum (y_i – \hat{y}_i)^2)

Example Calculation: If actual values are ([3, -0.5, 2, 7]) and predicted values are ([2.5, 0.0, 2, 8]), then:

(MSE = \frac{(3-2.5)^2 + (-0.5-0.0)^2 + (2-2)^2 + (7-8)^2}{4} = 0.375)

Root Mean Squared Error (RMSE)

RMSE is the square root of MSE, which maintains the unit consistency:

(RMSE = \sqrt{MSE})

Example:

(RMSE = \sqrt{0.375} = 0.612)

Mean Absolute Error (MAE)

MAE measures the average absolute differences:

(MAE = \frac{1}{n} \sum |y_i – \hat{y}_i|)

Example:

(MAE = \frac{|3-2.5| + |-0.5-0.0| + |2-2| + |7-8|}{4} = 0.5)

R-squared measures how well the model explains the variance in (y):

(R^2 = 1 – \frac{SS_{res}}{SS_{tot}})

where:

  • (SS_{res} = \sum (y_i – \hat{y}_i)^2) (residual sum of squares)
  • (SS_{tot} = \sum (y_i – \bar{y})^2) (total sum of squares)

Example Calculation: If (SS_{res} = 10) and (SS_{tot} = 20):

(R^2 = 1 – \frac{10}{20} = 0.5)

Weighted R-Squared

Weighted R-squared accounts for different importance levels in observations:

(R^2_w = 1 – \frac{\sum w_i (y_i – \hat{y}_i)^2}{\sum w_i (y_i – \bar{y})^2})

where (w_i) are weights. If higher weights are given to important points, they influence the score more.

6. Python Implementation

Here is a Python implementation of simple linear regression using NumPy and Matplotlib:

"""
Linear Regression Implementation
"""
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error

def linear_regression(x: np.ndarray, y: np.ndarray) -> tuple[float, float]:
    """
    Compute the slope and intercept using the least squares method.

    Args:
        x (np.ndarray): Independent variable
        y (np.ndarray): Dependent variable
    
    Returns:
        tuple[float, float]: Slope and intercept
    """
    n = len(x)
    x_mean, y_mean = np.mean(x), np.mean(y)
    slope = np.sum((x - x_mean) * (y - y_mean)) / np.sum((x - x_mean) ** 2)
    intercept = y_mean - slope * x_mean
    return slope, intercept

def plot_regression(x: np.ndarray, y: np.ndarray, slope: float, intercept: float) -> None:
    """
    Plot the linear regression line.

    Args:
        x (np.ndarray): Independent variable
        y (np.ndarray): Dependent variable
        slope (float): Slope of the line
        intercept (float): Intercept of the line
    """
    plt.scatter(x, y, color='blue', label='Data Points')
    plt.plot(x, slope * x + intercept, color='red', label='Regression Line')
    plt.xlabel('X')
    plt.ylabel('Y')
    plt.legend()
    plt.show()

if __name__ == "__main__":
    x_data = np.array([1, 2, 3, 4, 5])
    y_data = np.array([2, 4, 5, 4, 5])
    
    m, c = linear_regression(x_data, y_data)
    print(f"Slope: {m}, Intercept: {c}")
    plot_regression(x_data, y_data, m, c)

7. Conclusion

Linear regression is a fundamental technique in machine learning and statistics. Understanding its derivations, error metrics, and implementation helps in building better predictive models.

Full Code Available Here (Add actual link)

Leave a Reply

Your email address will not be published. Required fields are marked *