Understanding Variability in Data Science: Key Metrics and Python

Hey Everyone! 👋

If you don’t know me yet, I’m Dhyuthidhar Saraswathula, and I love writing about Computer Science and Data Science topics. Today, we’re diving into a fundamental statistical concept that lies at the heart of data analysis: Variability.

If you’re ready, it’s time to fasten your seatbelts and join me on this thrilling adventure. Let’s dive in and explore what’s ahead together!

What is Variability?

Variability, also called dispersion, measures how spread out or tightly clustered data values are. It's a cornerstone of statistics and data science that answers critical questions like:

Are the data points closely packed together or widely spread out?
How do we measure this spread and use it for decision-making in machine learning?

So variability is all about the spread of data, if the spread is good that means there are different types of data, making it easier for the machine to learn. It’s like you got the data for the heights of people in a class and there are short, medium and tall ones.

Why Should You Care About Variability?

Variability isn’t just a statistical term; it’s something we encounter daily.

Real-life analogy:

Imagine training a machine learning model to predict house prices. If the variability in data is high, your model might struggle to generalize, resulting in poor predictions. Understanding and managing variability can improve your model's performance.

Key Terms for Variability Metrics

1. Deviations

The difference between observations and the estimate of location (mean, median, etc.). These are also called errors or residuals in machine learning.

2. Variance

Variance quantifies how spread out the data points are from the mean.

Sample Variance

$$s^2 = \frac{1}{N-1} \sum_{i=1}^{N} (x_i - \bar{x})^2$$

Population Variance

$$\sigma^2 = \frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2$$

Here:

N = Number of data points.
x_i = Individual data point
mu/x_bar = Population Mean/Sample Mean

In Python, you can calculate variance using numpy or pandas:

import numpy as np

data = [2, 4, 6, 8, 10]
# ddof stands for "Delta Degrees of Freedom."
# Set ddof=1 for sample variance and ddof=0 for population variance.

sample_variance = np.var(data, ddof=1)  
population_variance = np.var(data, ddof=0)  
print(f"Sample Variance: {sample_variance}")
print(f"Population Variance: {population_variance}")

3. Standard Deviation

The square root of variance represents the average distance of data points from the mean.

Sample Standard Deviation

$$s = \sqrt{\frac{1}{N-1} \sum_{i=1}^{N} (x_i - \bar{x})^2}$$

Population Standard Deviation

$$\sigma = \sqrt{\frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2}$$

Python Implementation:

sample_std_dev = np.std(data, ddof=1)  # ddof=1 for sample
population_std_dev = np.std(data)     # Default for population
print(f"Sample Standard Deviation: {sample_std_dev}")
print(f"Population Standard Deviation: {population_std_dev}")

4. Mean Absolute Deviation (MAD)

The mean of the absolute deviations from the mean:

$$\text{MAD} = \frac{1}{n} \sum_{i=1}^{n} |x_i - \bar{x}|$$

Python Implementation:

mad = np.mean(np.abs(data - np.mean(data)))
print(f"Mean Absolute Deviation: {mad}")

5. Median Absolute Deviation (Robust to Outliers)

The median of the absolute deviations from the median:

$$\text{MAD}_{\text{median}} = \text{median}(|x_i - \text{median}(x)|)$$

Python Implementation:

from scipy.stats import median_abs_deviation

mad_median = median_abs_deviation(data)
print(f"Median Absolute Deviation: {mad_median}")

6. Range and Interquartile Range (IQR)

Range

The difference between the maximum and minimum values:

$$\text{Range} = \text{max}(x) - \text{min}(x)$$

Interquartile Range

The difference between the 75th percentile (Q3) and 25th percentile (Q1):

$$\text{IQR} = Q_3 - Q_1$$

Python Implementation:

iqr = np.percentile(data, 75) - np.percentile(data, 25)
print(f"Interquartile Range: {iqr}")

Why Use N−1 Instead of N for Sample Variance?

Degrees of Freedom: The number of independent values that can vary in a calculation.
Bessel's Correction: Adjusting by N−1 makes the sample variance an unbiased estimate of the population variance.

For large datasets, the difference between N and N-1 is negligible.

Robust Measures of Variability

For datasets with outliers, median-based measures like the Median Absolute Deviation (MAD) or trimmed standard deviation are preferred.

Python Implementation for Trimmed Variance:

from scipy.stats import trim_mean

trimmed_variance = np.var(data[1:-1])  # Exclude outliers
print(f"Trimmed Variance: {trimmed_variance}")

Conclusion

I think you guys got to know what is Variability and here is a summary of it, Variability is a cornerstone of data analysis and machine learning. It provides insight into the structure of your data, helping you make informed decisions. As we dive deeper into data science, understanding these foundational concepts will sharpen our ability to build reliable machine-learning models. In the future, we are going to talk about Estimation Based on Percentiles.

🗝

Try calculating these metrics on your dataset and let me know your observations!

References:

If you guys are interested to learn more about the statistics in ML you can buy this book or if you want to learn Numpy you can visit the documentation -:

Practical Statistics for Data Scientists
Official NumPy Documentation

You can also visit Krish Naik youtube channel and can learn statistics.

Understanding Variability in Data Science: A Python Perspective

What is Variability?

Why Should You Care About Variability?

Real-life analogy:

Key Terms for Variability Metrics

1. Deviations

2. Variance

Sample Variance

Population Variance

3. Standard Deviation

Sample Standard Deviation

Population Standard Deviation

4. Mean Absolute Deviation (MAD)

5. Median Absolute Deviation (Robust to Outliers)

6. Range and Interquartile Range (IQR)

Range

Interquartile Range

Why Use N−1 Instead of N for Sample Variance?

Robust Measures of Variability

Conclusion