Processing math: 0%

The X,Y,M and C of Linear Regression

Understanding Linear Regression

Linear regression is a fundamental statistical method used in machine learning and data analysis to model relationships between variables. This post explains the concept in detail, with derivations, error metrics, and Python code.

1. Understanding the Equation: y = mx + c

A linear equation in its simplest form is:
y = mx + c
where:

  • y is the dependent variable (output or prediction).
  • x is the independent variable (input feature).
  • m is the slope (rate of change ofy with respect to x).
  • c is the y-intercept (value of y when x = 0).

Derivation

A straight-line equation is derived from the general equation of a line:
y – y_1 = m(x – x_1)
Solving for ( y ):
y = m x + (y_1 – m x_1)
Here, \( c = y_1 – m x_1\ ), making it the intercept.

2. Understanding the Slope m

The slope m measures the steepness of the line and is calculated as:
m = \frac{y_2 – y_1}{x_2 – x_1}
This formula finds the rate of change between two points (x₁, y₁) and (x₂, y₂).

Python Code for Slope Calculation

import numpy as np

def calculate_slope(x1: float, y1: float, x2: float, y2: float) -> float:
    """Calculate the slope between two points."""
    return (y2 - y1) / (x2 - x1)

# Example usage
m = calculate_slope(1, 2, 3, 6)
print(f"Slope: {m}")

3. What is a Linear Equation?

A linear equation represents a straight line on a graph and has the general form:
ax + by + c = 0
For regression, we simplify this to y = mx + c .

4. Formulating Linear Regression

Linear regression aims to find the best-fitting line for a dataset. We estimate m and c using the Least Squares Method:

  • The best-fit line minimizes the sum of squared residuals.
  • The slope and intercept formulas are:
    m = \frac{\sum (x_i – \bar{x})(y_i – \bar{y})}{\sum (x_i – \bar{x})^2}
    c = \bar{y} – m \bar{x}
    where and are the mean values of x and y.

Python Code for Linear Regression Calculation

def linear_regression(x: np.ndarray, y: np.ndarray) -> tuple[float, float]:
    """Compute the slope and intercept for linear regression."""
    m = np.sum((x - np.mean(x)) * (y - np.mean(y))) / np.sum((x - np.mean(x))**2)
    c = np.mean(y) - m * np.mean(x)
    return m, c

# Example usage
x_vals = np.array([1, 2, 3, 4, 5])
y_vals = np.array([2, 3, 5, 7, 11])
m, c = linear_regression(x_vals, y_vals)
print(f"Slope: {m}, Intercept: {c}")

5. Error Metrics in Linear Regression

1. Mean Squared Error (MSE)

Formula:

MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i – \hat{y}_i)^2

Variables:

  • y_i = Actual value
  • \hat{y}_i = Predicted value
  • n = Number of observations

Interpretation:

  • A lower MSE value indicates a better fit.
  • A high MSE suggests significant deviations between actual and predicted values.

Pros:

  • Penalizes large errors more than small ones due to squaring.
  • Differentiable, making it suitable for optimization algorithms.

Cons:

  • Sensitive to outliers.
  • Hard to interpret since it’s in squared units of the dependent variable.

2. Mean Absolute Error (MAE)

Formula:

MAE = \frac{1}{n} \sum_{i=1}^{n} |y_i – \hat{y}_i|

Interpretation:

  • MAE represents the average absolute difference between actual and predicted values.
  • A lower MAE means better performance.
  • Unlike MSE, it treats all errors equally.

Pros:

  • Less sensitive to outliers compared to MSE.
  • Easier to interpret since it is in the same unit as the dependent variable.

Cons:

  • Not differentiable at zero, which can be problematic for some optimization methods.
  • Does not emphasize large errors.

3. Mean Absolute Percentage Error (MAPE)

Formula:

MAPE = \frac{100}{n} \sum_{i=1}^{n} \left| \frac{y_i – \hat{y}_i}{y_i} \right|

Interpretation:

  • Represents the percentage error relative to the actual values.
  • A lower MAPE means better model performance.
  • Example: A MAPE of 10% means predictions are, on average, 10% off from actual values.

Pros:

  • Provides an easy-to-understand percentage error.

Cons:

  • Undefined when actual values are zero.
  • Overly penalizes small actual values.

4. Root Mean Squared Error (RMSE)

Formula:

RMSE = \sqrt{MSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i – \hat{y}_i)^2}

Interpretation:

  • RMSE is in the same units as the dependent variable.
  • A lower RMSE indicates better performance.

Pros:

  • Easy to interpret in the original unit of the data.
  • Penalizes large errors more due to squaring.

Cons:

  • Sensitive to outliers.

5. R-Squared (( R^2 ))

Formula:

$$ R^2 = 1 – \frac{\sum_{i=1}^{n} (y_i – \hat{y}i)^2}{\sum{i=1}^{n} (y_i – \bar{y})^2} $$

Interpretation:

  • Measures how well the independent variable(s) explain the variation in the dependent variable.
  • Ranges from 0 to 1:
  • ( R^2 = 0.8 ) means 80% of the variance in the dependent variable is explained by the independent variables.
  • ( R^2 = 0.3 ) means only 30% is explained.
  • Higher ( R^2 ) is better, but not always an indicator of good predictive power.

Pros:

  • Provides an intuitive measure of model performance.

Cons:

  • Can be artificially high with too many predictors (adjusted ( R^2 ) should be used instead).
  • Does not indicate whether the model is a good fit for forecasting.

6. Adjusted R-Squared

Formula:

Interpretation:

  • Adjusted ( R^2) accounts for the number of predictors.
  • Penalizes excessive predictors that do not improve model performance.

Pros:

  • More reliable than ( R^2) when multiple predictors are used.

Cons:

  • Still does not indicate predictive accuracy.

7. Table: Summary of Error Metrics


MetricInterpretationProsCons
MSEMeasures the average squared difference between actual and predicted values, penalizing larger errors more heavily.Useful for optimization due to its differentiability.Highly sensitive to outliers.
MAEMeasures the average absolute difference between actual and predicted values, treating all deviations equally.Easy to interpret and less sensitive to outliers than MSE.Does not penalize larger errors more than smaller ones, which may be a drawback when large deviations are important.
RMSEProvides an error metric in the same unit as the original data, making it more interpretable.Penalizes larger errors more than MAE, making it useful when large deviations are critical.Still sensitive to outliers due to squaring errors before averaging.
MAPEExpresses the average error as a percentage of actual values, making comparisons across datasets easier.Scale-free and intuitive for business applications.Problematic when actual values are close to zero, leading to inflated errors.
( R^2 )Measures the proportion of variance in the dependent variable explained by the model.Helps assess model fit and comparison with other models.Can be misleading if used on non-linear data or for models with many predictors.
Adjusted ( R^2 )Adjusts ( R^2 ) for the number of predictors, preventing overestimation of model performance.More reliable than ( R^2 ) in multiple regression scenarios.Still not a direct measure of predictive accuracy.

Python Code for Error Calculation

def calculate_errors(y_true: np.ndarray, y_pred: np.ndarray) -> dict[str, float]:
    """Compute error metrics for linear regression."""
    mae = np.mean(np.abs(y_true - y_pred))
    mse = np.mean((y_true - y_pred) ** 2)
    rmse = np.sqrt(mse)
    r2 = 1 - (np.sum((y_true - y_pred) ** 2) / np.sum((y_true - np.mean(y_true)) ** 2))
    return {"MAE": mae, "MSE": mse, "RMSE": rmse, "R2": r2}

# Example usage
y_actual = np.array([2, 3, 5, 7, 11])
y_predicted = np.array([2.1, 3.1, 4.8, 6.9, 10.8])
errors = calculate_errors(y_actual, y_predicted)
print(errors)

Conclusion

Linear regression is a powerful yet simple method for predicting continuous values. Understanding its formulation, errors, and calculations is crucial for real-world applications.

Full Code Repository

Find the complete code.

Leave a Reply

Your email address will not be published. Required fields are marked *