Trending

Point-Biserial Correlation in R: Full Guide with Code

Introduction to Point-Biserial Correlation

In biological and environmental sciences, we often need to measure the relationship between a binary categorical variable (e.g., presence/absence of contamination) and a continuous variable (e.g., fish growth rate).

Point-biserial correlation is a special case of the Pearson correlation used when one variable is dichotomous (binary: 0 or 1) and the other is continuous.

 When to Use Point-Biserial Correlation?

  • Measuring the effect of pollution (High/Low) on plant growth.
  • Analyzing whether disease status (Present/Absent) affects blood pressure levels.
  • Checking if gene mutation (Yes/No) correlates with enzyme activity.

In this tutorial, we will simulate a dataset in R, perform point-biserial correlation, visualize the results with graphs, and interpret the findings.

Step 1: Understanding the Point-Biserial Correlation Formula

The point-biserial correlation coefficient (rpbr_{pb}) is calculated as:

rpb=M1M2sn1n2n(n1)r_{pb} = \frac{M_1 - M_2}{s} \sqrt{\frac{n_1 n_2}{n (n - 1)}}Where:

  • M1M_1, M2M_2 = Mean of the continuous variable for each group.
  • ss = Standard deviation of the continuous variable.
  • n1n_1, n2n_2 = Sample size for each group.
  • nn = Total sample size.

Now, let’s generate a dataset and compute point-biserial correlation in R.

Step 2: Generate a Simulated Dataset in R

We'll create a dataset where:

  • Contamination (Binary: Low = 0, High = 1) affects
  • Fish Growth Rate (Continuous: cm/month)
# Load necessary libraries
library(ggplot2)
library(dplyr)

# Set seed for reproducibility
set.seed(123)

# Generate Data
n <- 50  # Sample size
contamination <- sample(0:1, n, replace = TRUE)  # Binary variable: Low (0) or High (1)
growth_rate <- ifelse(contamination == 1,
                      rnorm(n, mean = 3.5, sd = 0.8),  # High contamination group
                      rnorm(n, mean = 5.0, sd = 0.9))  # Low contamination group

# Create Data Frame
fish_data <- data.frame(Contamination = contamination, GrowthRate = growth_rate)


# View first few rows
head(fish_data)

Step 3: Compute Point-Biserial Correlation in R

Now, we calculate point-biserial correlation using cor.test().

# Compute Point-Biserial Correlation

cor_test <- cor.test(fish_data$GrowthRate, as.numeric(fish_data$Contamination), method = "pearson")


# Print correlation results

print(cor_test)

Step 4: Visualizing the Results

 Boxplot: Fish Growth vs. Contamination Level

# Convert Contamination to Factor for Visualization
fish_data$Contamination <- factor(fish_data$Contamination, labels = c("Low", "High"))

# Boxplot
ggplot(fish_data, aes(x = Contamination, y = GrowthRate, fill = Contamination)) +
  geom_boxplot(alpha = 0.6) +
  geom_jitter(width = 0.2, alpha = 0.5, color = "black") +
  labs(title = "Effect of Heavy Metal Contamination on Fish Growth",
       x = "Contamination Level",
       y = "Fish Growth Rate (cm/month)") +
  theme_minimal() +
  scale_fill_manual(values = c("blue", "red"))

Fish Growth vs. Contamination Level

Scatter Plot with Regression Line

# Scatter Plot
ggplot(fish_data, aes(x = as.numeric(Contamination) - 1, y = GrowthRate)) +
  geom_point(alpha = 0.7, color = "blue") +
  geom_smooth(method = "lm", color = "red", se = TRUE) +
  labs(title = "Scatter Plot of Fish Growth vs. Contamination Level",
       x = "Contamination Level (0 = Low, 1 = High)",
       y = "Fish Growth Rate (cm/month)") +
  theme_minimal()
Scatter Plot with Regression Line

 Step 5: Additional Statistical Analysis (T-Test)

We can also perform an independent t-test to check if growth rate significantly differs between contamination levels.

# Perform Independent T-Test

t_test <- t.test(GrowthRate ~ Contamination, data = fish_data)


# Print t-test results

print(t_test)

📥 Download Sample Dataset

👉 Click here to download the fish growth dataset (Excel)

Full R Code

# Load necessary libraries
library(ggplot2)
library(dplyr)


# Simulated dataset (replace with your actual dataset if available)
set.seed(123)  # For reproducibility

n <- 50  # Number of observations

 # Binary variable (0 = Low, 1 = High)
contamination <- sample(0:1, n, replace = TRUE) 


# Higher contamination group
# Lower contamination group

growth_rate <- ifelse(contamination == 1,
                      rnorm(n, mean = 3.5, sd = 0.8),  
                      rnorm(n, mean = 5.0, sd = 0.9))  

# Create Data Frame
fish_data <- data.frame(Contamination = contamination, GrowthRate = growth_rate)

# Read dataset from excel
fish_growth <- read_excel("fish_growth.xlsx")
View(fish_growth)

# View first few rows
head(fish_data)

# Convert Contamination to a factor for visualization
fish_data$Contamination <- factor(fish_data$Contamination, labels = c("Low", "High"))

# Summary Statistics
summary_stats <- fish_data %>%
  group_by(Contamination) %>%
  summarise(Mean_Growth = mean(GrowthRate),
            SD_Growth = sd(GrowthRate),
            Min_Growth = min(GrowthRate),
            Max_Growth = max(GrowthRate),
            Count = n())

print(summary_stats)

# --------------------
# 🔹 Point-Biserial Correlation
# --------------------
cor_test <- cor.test(fish_data$GrowthRate, as.numeric(fish_data$Contamination) - 1, method = "pearson")

# Print correlation results
print(cor_test)

# --------------------
# 🔹 Independent T-Test
# --------------------
t_test <- t.test(GrowthRate ~ Contamination, data = fish_data)

# Print t-test results
print(t_test)

# --------------------
# 🔹 Boxplot: Fish Growth Rate by Contamination Level
# --------------------
ggplot(fish_data, aes(x = Contamination, y = GrowthRate, fill = Contamination)) +
  geom_boxplot(alpha = 0.6) +
  geom_jitter(width = 0.2, alpha = 0.5, color = "black") +
  labs(title = "Effect of Heavy Metal Contamination on Fish Growth",
       x = "Contamination Level",
       y = "Fish Growth Rate (cm/month)") +
  theme_minimal() +
  scale_fill_manual(values = c("blue", "red"))

# --------------------
# 🔹 Scatterplot: Fish Growth Rate vs. Contamination Level
# --------------------
ggplot(fish_data, aes(x = as.numeric(Contamination) - 1, y = GrowthRate)) +
  geom_point(alpha = 0.7, color = "blue") +
  geom_smooth(method = "lm", color = "red", se = TRUE) +
  labs(title = "Scatter Plot of Fish Growth vs. Contamination Level",
       x = "Contamination Level (0 = Low, 1 = High)",
       y = "Fish Growth Rate (cm/month)") +
  theme_minimal()

 Conclusion

In this article, we explored point-biserial correlation in R, a powerful statistical method used to measure the relationship between a binary categorical variable (e.g., contamination level) and a continuous variable (e.g., fish growth rate).

Through data simulation, correlation analysis, visualization, and hypothesis testing, we gained insights into how environmental contamination affects biological growth. Key takeaways include:

✅ Point-biserial correlation provides a statistical measure of association.
✅ Graphical representations (boxplots and scatter plots) help visualize the trend.
✅ T-tests confirm whether differences between groups are statistically significant.

This method is widely applicable in biostatistics, ecology, health sciences, and genetics, making it a valuable tool for researchers analyzing binary vs. continuous relationships.

Post a Comment

Previous Post Next Post