Introduction: Understanding Correlation and Its Importance in Data Analysis
In statistics, understanding relationships between variables is crucial for data interpretation, decision-making, and predictive modeling. One of the most common and powerful tools for this is the correlation coefficient (r), which measures the strength and direction of a linear relationship between two continuous variables.
This tutorial will guide you through calculating the Pearson correlation coefficient in R Studio, using a practical example of height and weight data, and visualizing the relationship using a scatter plot with a regression line.
By the end of this post, you’ll be able to:
- Calculate correlation in R
- Interpret correlation results
- Create a scatter plot in R
- Add a regression line and annotate your plot
- Understand how to use visualizations in exploratory data analysis (EDA)
What Is the Correlation Coefficient (r)?
Definition of Pearson’s Correlation Coefficient
The Pearson correlation coefficient (r) is a measure of the linear association between two continuous variables. It ranges from -1 to +1:
- r = +1: Perfect positive linear relationship
- r = -1: Perfect negative linear relationship
- r = 0: No linear relationship
Formula for Pearson’s r
Dataset Used for Correlation Analysis
Here, we’ll use a sample dataset representing height (cm) and weight (kg) of 15 individuals
Step-by-Step R Code to Calculate Correlation and Create Scatter Plot
Step 1 – Define the Data in R
height <- c(150, 152, 155, 158, 160, 162, 165, 168, 170, 172, 175, 178, 180, 183, 185)
weight <- c(48, 50, 52, 54, 56, 58, 60, 62, 64, 65, 68, 70, 72, 75, 78)
Here we are creating two numeric vectors: height
and weight
.
Step 2 – Calculate Pearson’s Correlation Coefficient
cor.test(height, weight, method = "pearson")
This line performs a Pearson correlation test, which not only calculates the value of r
but also gives statistical significance (p-value and confidence interval).
You can extract the r-value and round it for display:
cor_result <- cor.test(height, weight, method = "pearson")
r_value <- round(cor_result$estimate, 2)
Sample Output
Pearson's product-moment correlation
data: height and weight
t = 35.67, df = 13, p-value = 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.9871 0.9988
sample estimates:
cor
0.99588
This means the correlation is very strong and positive (r ≈ 0.996).
Step 3 – Create a Basic Scatter Plot
plot(height, weight,
main = "Height vs Weight",
xlab = "Height (cm)",
ylab = "Weight (kg)",
pch = 19,
col = "blue")
This code generates a simple scatter plot with blue circular points.
Step 4 – Add Regression Line
abline(lm(weight ~ height), col = "red")
This line adds a linear regression line in red, showing the trend of the data.
Step 5 – Annotate the Plot with the r Value
text(x = min(height) + 5, y = max(weight) - 2,
labels = paste("r =", r_value),
col = "darkgreen", cex = 1.2, font = 2)
This line places the correlation coefficient (r) on the plot.
.jpeg) |
correlation-coefficient-scatter-plot-r-studio |
Complete Code Block in R Studio
Interpretation of Results
Strength and Direction
The calculated r ≈ 0.996 shows a very strong, positive linear relationship between height and weight.
Statistical Significance
The p-value < 0.05 indicates that the correlation is statistically significant.
Applications of Correlation in Biological Sciences
Domain |
Use Case |
Epidemiology |
Height vs BMI, blood pressure vs cholesterol |
Psychology |
Stress level vs sleep quality |
Environmental Sci. |
Temperature vs species diversity |
Agriculture |
Rainfall vs crop yield |
Conclusion
The correlation coefficient (r) is a fundamental statistical tool that reveals relationships between continuous variables. Using R Studio, you can quickly compute this value, test its significance, and visually display it with a scatter plot and regression line.
In our example, height and weight showed a strong positive correlation (r ≈ 0.996), illustrating how R can effectively explore real-world relationships with just a few lines of code.