Linear Regression: A Comprehensive Guide
Linear regression is based on the equation of a straight line:
$latex (y = mx + c)
where:
- (y) is the dependent variable (output)
- (x) is the independent variable (input)
- (m) is the slope of the line
- (c) is the y-intercept
Derivation of the Equation
A straight line represents a constant rate of change. Given two points ((x_1, y_1)) and ((x_2, y_2)), the slope (m) is calculated as:
(m = \frac{y_2 – y_1}{x_2 – x_1})
Using this slope, the equation of the line passing through ((x_1, y_1)) is:
(y – y_1 = m (x – x_1))
Rearranging, we get the familiar form (y = mx + c), where (c) is the y-intercept.
The slope (m) determines how much (y) changes for a unit increase in (x). If (m) is positive, the line inclines upward; if negative, it declines downward.
Calculation of the Slope
For a dataset with multiple points ((x_i, y_i)), the slope in linear regression is calculated as:
(m = \frac{\sum (x_i – \bar{x})(y_i – \bar{y})}{\sum (x_i – \bar{x})^2})
where (\bar{x}) and (\bar{y}) are the means of (x) and (y) respectively.
3. What is a Linear Equation?
A linear equation represents a straight-line relationship between variables. It satisfies the property of additivity and homogeneity, meaning it models relationships with constant proportional change.
4. Formulating Linear Regression
Linear regression finds the best-fitting line through a dataset by minimizing the error between predicted and actual values. The equation is:
(\hat{y} = \beta_0 + \beta_1 x)
where:
- (\hat{y}) is the predicted value
- (\beta_0) is the intercept
- (\beta_1) is the slope
Derivation Using Least Squares Method
We minimize the sum of squared errors:
(J(\beta_0, \beta_1) = \sum_{i=1}^{n} (y_i – \hat{y}_i)^2)
Taking partial derivatives with respect to (\beta_0) and (\beta_1) and solving for zero gives:
(\beta_1 = \frac{\sum (x_i – \bar{x})(y_i – \bar{y})}{\sum (x_i – \bar{x})^2}) (\beta_0 = \bar{y} – \beta_1 \bar{x})
5. Error Metrics in Linear Regression
Error metrics help evaluate model performance:
Mean Squared Error (MSE)
MSE measures the average squared differences between actual and predicted values:
(MSE = \frac{1}{n} \sum (y_i – \hat{y}_i)^2)
Example Calculation: If actual values are ([3, -0.5, 2, 7]) and predicted values are ([2.5, 0.0, 2, 8]), then:
(MSE = \frac{(3-2.5)^2 + (-0.5-0.0)^2 + (2-2)^2 + (7-8)^2}{4} = 0.375)
Root Mean Squared Error (RMSE)
RMSE is the square root of MSE, which maintains the unit consistency:
(RMSE = \sqrt{MSE})
Example:
(RMSE = \sqrt{0.375} = 0.612)
Mean Absolute Error (MAE)
MAE measures the average absolute differences:
(MAE = \frac{1}{n} \sum |y_i – \hat{y}_i|)
Example:
(MAE = \frac{|3-2.5| + |-0.5-0.0| + |2-2| + |7-8|}{4} = 0.5)
R-squared measures how well the model explains the variance in (y):
(R^2 = 1 – \frac{SS_{res}}{SS_{tot}})
where:
- (SS_{res} = \sum (y_i – \hat{y}_i)^2) (residual sum of squares)
- (SS_{tot} = \sum (y_i – \bar{y})^2) (total sum of squares)
Example Calculation: If (SS_{res} = 10) and (SS_{tot} = 20):
(R^2 = 1 – \frac{10}{20} = 0.5)
Weighted R-Squared
Weighted R-squared accounts for different importance levels in observations:
(R^2_w = 1 – \frac{\sum w_i (y_i – \hat{y}_i)^2}{\sum w_i (y_i – \bar{y})^2})
where (w_i) are weights. If higher weights are given to important points, they influence the score more.
6. Python Implementation
Here is a Python implementation of simple linear regression using NumPy and Matplotlib:
"""
Linear Regression Implementation
"""
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
def linear_regression(x: np.ndarray, y: np.ndarray) -> tuple[float, float]:
"""
Compute the slope and intercept using the least squares method.
Args:
x (np.ndarray): Independent variable
y (np.ndarray): Dependent variable
Returns:
tuple[float, float]: Slope and intercept
"""
n = len(x)
x_mean, y_mean = np.mean(x), np.mean(y)
slope = np.sum((x - x_mean) * (y - y_mean)) / np.sum((x - x_mean) ** 2)
intercept = y_mean - slope * x_mean
return slope, intercept
def plot_regression(x: np.ndarray, y: np.ndarray, slope: float, intercept: float) -> None:
"""
Plot the linear regression line.
Args:
x (np.ndarray): Independent variable
y (np.ndarray): Dependent variable
slope (float): Slope of the line
intercept (float): Intercept of the line
"""
plt.scatter(x, y, color='blue', label='Data Points')
plt.plot(x, slope * x + intercept, color='red', label='Regression Line')
plt.xlabel('X')
plt.ylabel('Y')
plt.legend()
plt.show()
if __name__ == "__main__":
x_data = np.array([1, 2, 3, 4, 5])
y_data = np.array([2, 4, 5, 4, 5])
m, c = linear_regression(x_data, y_data)
print(f"Slope: {m}, Intercept: {c}")
plot_regression(x_data, y_data, m, c)
7. Conclusion
Linear regression is a fundamental technique in machine learning and statistics. Understanding its derivations, error metrics, and implementation helps in building better predictive models.
Full Code Available Here (Add actual link)
Leave a Reply