Pearson Product-Moment Correlation: This is the most common numerical measure of correlation, represented by the letter . It is calculated by dividing the covariance of the two variables by the product of their standard deviations, effectively standardizing the relationship.
Mathematical Range: The value of is strictly bounded between and . A value of indicates a perfect positive linear relationship, indicates a perfect negative linear relationship, and indicates no linear relationship exists.
Dimensionless Property: Because is a standardized value, it has no units of measurement. This allows for the comparison of relationships between variables that have entirely different scales, such as height in centimeters and weight in kilograms.
Visual Inspection via Scatter Plots: Before calculating , data should be plotted on a Cartesian plane to identify the general trend. This step is crucial for detecting outliers or non-linear patterns that might make the correlation coefficient misleading.
Calculating the Coefficient: The formula for involves summing the products of the deviations of each variable from their respective means. The formula is expressed as:
Interpreting Direction: A positive value means that as the independent variable increases, the dependent variable also tends to increase. Conversely, a negative value indicates an inverse relationship where one variable increases as the other decreases.
| Feature | Correlation | Causation |
|---|---|---|
| Definition | A statistical association between two variables. | A relationship where one variable directly influences the other. |
| Requirement | Only requires data to move in a predictable pattern. | Requires experimental evidence and control of variables. |
| Conclusion | "Variable A is related to Variable B." | "Variable A causes a change in Variable B." |
| Example | Ice cream sales and drowning rates are correlated. | Heat causes both ice cream sales and increased swimming. |
Linear vs. Non-linear Relationships: Pearson's only detects linear patterns. A dataset could have a perfect U-shaped relationship (quadratic) where would be near zero, even though the variables are clearly related.
Strength vs. Slope: The correlation coefficient measures how tightly points cluster around a line, not the steepness of that line. A very shallow line and a very steep line can both have an value of if the points fall exactly on the line.
Check the Bounds: Always verify that your calculated value falls within the range of . If you calculate a value like or , there is a mathematical error in your summation or square root steps.
Identify Outliers: In exam questions, look for a single data point that sits far away from the general cluster. Outliers can significantly pull the line of best fit and artificially inflate or deflate the correlation coefficient.
Contextual Interpretation: When asked to describe a correlation, always include both strength (e.g., strong, moderate, weak) and direction (positive or negative). For example, "There is a strong negative correlation between the variables."
The Causation Trap: The most frequent error is assuming that because two variables are correlated, one must cause the other. Often, a third "lurking variable" is responsible for the movement in both, or the relationship is purely coincidental.
Ignoring the Scatter Plot: Relying solely on the numerical value of can be dangerous. Different data distributions (like Anscombe's Quartet) can produce the exact same value while representing completely different types of relationships.
Sample Size Sensitivity: While measures the strength of a relationship, it does not tell you if that relationship is statistically significant. In very small samples, a high correlation might occur by random chance alone.