Understanding Key Statistical Concepts: Percentages, Percentiles, Quartiles, and Moments

Introduction
Percentages: The Basics
Percentiles: Positioning Within Data
Quartiles: Dividing Data Sets
Moments: Mean, Variance, Skewness, and Kurtosis
Data Distributions: Normal vs. Exponential
Practical Implementation with Python
Conclusion

Introduction

Statistics forms the backbone of data analysis, providing tools and methodologies to interpret and make sense of data. Key statistical measures like percentages, percentiles, quartiles, and moments offer insights into data distribution, variability, and trends. This article explores these concepts in detail, illustrating their significance and application in real-world scenarios, especially in machine learning and data visualization.

Percentages: The Basics

Percentage is a straightforward concept representing a part out of 100. It’s a ubiquitous measure used to express proportions, comparisons, and changes in various contexts.

Calculating Percentage

To calculate the percentage, use the formula:

\[ \text{Percentage} = \left( \frac{\text{Part}}{\text{Whole}} \right) \times 100 \]

Example:

If you score 95 out of 100, your percentage is:

\[ \left( \frac{95}{100} \right) \times 100 = 95\% \]

For a score of 150 out of 200, the percentage is:

\[ \left( \frac{150}{200} \right) \times 100 = 75\% \]

Percentages are foundational in various analyses, from academic grading to market share assessments.

Percentiles: Positioning Within Data

Percentiles indicate the relative standing of a value within a data set. They divide a data set into 100 equal parts, each representing 1%.

Understanding Percentiles

25th Percentile (Q1): 25% of the data points fall below this value.
50th Percentile (Median or Q2): 50% of the data points fall below this value.
75th Percentile (Q3): 75% of the data points fall below this value.

Practical Example:

Consider the wealth distribution in a population:

If a family’s annual income is at the 25th percentile, it means 25% of families earn less, and 75% earn more.
At the 50th percentile (Median), half of the population earns less, and half earns more.

Visual Representation:

Imagine a graph where the x-axis represents percentiles (1 to 99) and the y-axis shows cumulative wealth. Such a graph helps visualize wealth inequality, showcasing how wealth accumulates disproportionately across different percentiles.

Quartiles: Dividing Data Sets

Quartiles split a data set into four equal parts, each representing 25% of the data.

The Four Key Quartiles

First Quartile (Q1): 25% of data falls below this value.
Second Quartile (Q2): Also known as the Median, where 50% of data falls below.
Third Quartile (Q3): 75% of data falls below this value.
Fourth Quartile (Q4): The highest 25% of data points.

Importance of Quartiles

Quartiles are instrumental in understanding data dispersion and central tendency. They are foundational in constructing box plots, which visualize the distribution, identify outliers, and compare different data sets.

Box Plot Components:

Box: Represents the interquartile range (IQR) between Q1 and Q3.
Median Line: Inside the box, indicating the median (Q2).
Whiskers: Extend to the smallest and largest values within 1.5 * IQR from Q1 and Q3.
Outliers: Data points beyond the whiskers.

Moments: Mean, Variance, Skewness, and Kurtosis

Moments are quantitative measures related to the shape of a data distribution. The first four moments provide valuable insights into data characteristics:

First Moment (Mean): The average value.
Second Moment (Variance): Measures data dispersion around the mean.
Third Moment (Skewness): Indicates asymmetry in the distribution.
Fourth Moment (Kurtosis): Describes the “tailedness” of the distribution.

Detailed Explanation

1. Mean

The mean is the sum of all data points divided by the number of points. It represents the central value of the data.

\[ \text{Mean} (\mu) = \frac{\sum_{i=1}^{N} x_i}{N} \]

2. Variance

Variance measures how much data points differ from the mean.

\[ \text{Variance} (\sigma^2) = \frac{\sum_{i=1}^{N} (x_i – \mu)^2}{N} \]

A higher variance indicates greater dispersion.

3. Skewness

Skewness quantifies the asymmetry of the data distribution.

Positive Skew: Tail extends to the right; mean > median.
Negative Skew: Tail extends to the left; mean < median.

\[ \text{Skewness} = \frac{\frac{1}{N} \sum_{i=1}^{N} (x_i – \mu)^3}{\sigma^3} \]

4. Kurtosis

Kurtosis measures the “tailedness” of the distribution.

High Kurtosis: More data in the tails; sharper peak.
Low Kurtosis: Less data in the tails; flatter peak.

\[ \text{Kurtosis} = \frac{\frac{1}{N} \sum_{i=1}^{N} (x_i – \mu)^4}{\sigma^4} – 3 \]

*(The subtraction of 3 normalizes the kurtosis of a standard normal distribution to zero.)*

Data Distributions: Normal vs. Exponential

Understanding data distributions is pivotal in statistics and machine learning, influencing how models interpret data.

Normal Distribution

Often referred to as the bell curve, the normal distribution is symmetric about the mean, depicting that data near the mean are more frequent.

Characteristics:

Mean = Median = Mode
Defined by parameters: mean (μ) and standard deviation (σ)
Approximately 68% of data falls within ±1σ, 95% within ±2σ, and 99.7% within ±3σ from the mean.

Exponential Distribution

The exponential distribution is primarily used to model the time between events in a Poisson process. It’s characterized by a single parameter, λ (rate).

Characteristics:

Asymmetric: Right-skewed with a long tail.
Memoryless property: Future probabilities are independent of past events.

Comparison:

While the normal distribution is symmetric, the exponential distribution is skewed, making them suitable for different types of data analyses.

Practical Implementation with Python

To solidify the understanding of these concepts, let’s explore a practical example using Python’s numpy, matplotlib, and scipy libraries.

Generating and Visualizing Data

import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as sp

# Generate 100,000 data points from a normal distribution
values = np.random.normal(0.0, 1.5, 100000)

# Plot histogram
plt.hist(values, bins=50, edgecolor='k')
plt.title('Histogram of Normally Distributed Data')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()

import numpy as np

import matplotlib.pyplot as plt

import scipy.stats as sp

# Generate 100,000 data points from a normal distribution

values = np.random.normal(0.0, 1.5, 100000)

# Plot histogram

plt.hist(values, bins=50, edgecolor='k')

plt.title('Histogram of Normally Distributed Data')

plt.xlabel('Value')

plt.ylabel('Frequency')

plt.show()

Output:

Histogram

Calculating Moments

First Moment: Mean

mean = np.mean(values)
print(f"Mean: {mean}")

1 2	mean = np.mean(values) print(f"Mean: {mean}")

Output:

Mean: 0.00617

1	Mean: 0.00617

Second Moment: Variance

variance = np.var(values)
print(f"Variance: {variance}")

1 2	variance = np.var(values) print(f"Variance: {variance}")

Output:

Variance: 2.24267

1	Variance: 2.24267

Third Moment: Skewness

skewness = sp.skew(values)
print(f"Skewness: {skewness}")

1 2	skewness = sp.skew(values) print(f"Skewness: {skewness}")

Output:

Skewness: -0.00366

1	Skewness: -0.00366

*Indicates a slight negative skew.*

Fourth Moment: Kurtosis

kurtosis = sp.kurtosis(values)
print(f"Kurtosis: {kurtosis}")

1 2	kurtosis = sp.kurtosis(values) print(f"Kurtosis: {kurtosis}")

Output:

Kurtosis: 0.01309

1	Kurtosis: 0.01309

*Close to zero, indicating a distribution similar to the normal distribution.*

Interpretation

Mean (~0): Data is centered around zero.
Variance (~2.24): Indicates the spread of data points.
Skewness (~-0.00366): Nearly symmetric; slight negative skew.
Kurtosis (~0.01309): Flatness compared to a normal distribution is negligible.

Conclusion

A deep understanding of statistical concepts like percentages, percentiles, quartiles, and moments is indispensable for effective data analysis and machine learning. These measures not only provide insights into data distribution and variability but also underpin advanced analytical techniques and model-building processes. By leveraging tools like Python’s numpy and scipy, practitioners can efficiently compute and interpret these statistics, driving informed decision-making and fostering data-driven success.

Whether you’re analyzing financial data, assessing population demographics, or fine-tuning machine learning models, these foundational statistics serve as the bedrock for robust and insightful analysis.

S02L07 – Percentiles, moment and Quantiles

Understanding Key Statistical Concepts: Percentages, Percentiles, Quartiles, and Moments

Table of Contents

Introduction

Percentages: The Basics

Calculating Percentage

Percentiles: Positioning Within Data

Understanding Percentiles

Quartiles: Dividing Data Sets

The Four Key Quartiles

Importance of Quartiles

Moments: Mean, Variance, Skewness, and Kurtosis

Detailed Explanation

1. Mean

2. Variance

3. Skewness

4. Kurtosis

Data Distributions: Normal vs. Exponential

Normal Distribution

Exponential Distribution

Practical Implementation with Python

Generating and Visualizing Data

Calculating Moments

First Moment: Mean

Second Moment: Variance

Third Moment: Skewness

Fourth Moment: Kurtosis

Interpretation

Conclusion

Further Reading