Least Squares Criterion: The most common method for finding the line of best fit is the 'Least Squares' method. It works by minimizing the sum of the squares of the vertical distances (residuals) between each data point and the line.
Residuals: A residual is defined as the difference between the observed value () and the predicted value () provided by the line. Squaring these values ensures that positive and negative deviations do not cancel each other out and penalizes larger outliers more heavily.
Centroid Property: Mathematically, the line of best fit for any dataset will always pass through the point , which represents the mean of all -values and the mean of all -values. This point acts as the 'balance point' of the entire distribution.
The Linear Equation: The line is expressed in the form , where is the slope and is the y-intercept. The slope indicates the average change in the dependent variable for every one-unit increase in the independent variable.
Calculating Slope (): The slope is determined by the correlation between the variables and their respective standard deviations. It can be calculated using the formula , where is the correlation coefficient.
Determining the Intercept (): Once the slope is known, the intercept is found by substituting the means of the data into the equation: . This ensures the line is anchored correctly relative to the average values of the dataset.
| Feature | Interpolation | Extrapolation |
|---|---|---|
| Definition | Predicting values within the range of observed data. | Predicting values outside the range of observed data. |
| Reliability | Generally high, as the model is supported by surrounding data. | Lower, as it assumes the trend continues indefinitely. |
| Risk | Low risk of error if the linear trend is consistent. | High risk; the relationship may change or become non-linear. |
Correlation vs. Causation: A line of best fit shows that two variables are related, but it does not prove that one causes the other. External 'lurking variables' might be influencing both, or the relationship might be purely coincidental.
Positive vs. Negative Correlation: A positive slope indicates that as increases, tends to increase. Conversely, a negative slope indicates that as increases, tends to decrease.
Verify the Correlation Coefficient (): Always check if the sign of matches the slope of your line. If is positive, your line must have an upward slope; if is negative, the line must slope downward.
Assess the Scatter: Before calculating, look at the scatter plot to see if a linear model is even appropriate. If the data points form a curve (like a parabola), a straight line of best fit will provide misleading predictions.
Identify Outliers: Be aware that a single point far from the rest of the data can 'pull' the line toward it, significantly altering the slope and intercept. In exams, you may be asked to describe how removing such a point would change the model's accuracy.
Forcing Through the Origin: A common mistake is assuming the line must pass through . Unless the data specifically dictates it, the y-intercept () should be calculated freely based on the data's distribution.
Over-reliance on : A high correlation coefficient () does not guarantee that a linear model is the best fit. Always check a residual plot; if the residuals show a clear pattern (like a U-shape), the original data is likely non-linear despite a high value.
Confusing and : The line of best fit for on is different from the line of on . Always ensure you are minimizing the vertical distances (errors in ) unless instructed otherwise.