S02L07 – Percentiles, moment and Quantiles

Understanding Key Statistical Concepts: Percentages, Percentiles, Quartiles, and Moments

Table of Contents

  1. Introduction
  2. Percentages: The Basics
  3. Percentiles: Positioning Within Data
  4. Quartiles: Dividing Data Sets
  5. Moments: Mean, Variance, Skewness, and Kurtosis
  6. Data Distributions: Normal vs. Exponential
  7. Practical Implementation with Python
  8. Conclusion

Introduction

Statistics forms the backbone of data analysis, providing tools and methodologies to interpret and make sense of data. Key statistical measures like percentages, percentiles, quartiles, and moments offer insights into data distribution, variability, and trends. This article explores these concepts in detail, illustrating their significance and application in real-world scenarios, especially in machine learning and data visualization.

Percentages: The Basics

Percentage is a straightforward concept representing a part out of 100. It’s a ubiquitous measure used to express proportions, comparisons, and changes in various contexts.

Calculating Percentage

To calculate the percentage, use the formula:

\[ \text{Percentage} = \left( \frac{\text{Part}}{\text{Whole}} \right) \times 100 \]

Example:

  • If you score 95 out of 100, your percentage is:

\[ \left( \frac{95}{100} \right) \times 100 = 95\% \]

  • For a score of 150 out of 200, the percentage is:

\[ \left( \frac{150}{200} \right) \times 100 = 75\% \]

Percentages are foundational in various analyses, from academic grading to market share assessments.

Percentiles: Positioning Within Data

Percentiles indicate the relative standing of a value within a data set. They divide a data set into 100 equal parts, each representing 1%.

Understanding Percentiles

  • 25th Percentile (Q1): 25% of the data points fall below this value.
  • 50th Percentile (Median or Q2): 50% of the data points fall below this value.
  • 75th Percentile (Q3): 75% of the data points fall below this value.

Practical Example:

Consider the wealth distribution in a population:

  • If a family’s annual income is at the 25th percentile, it means 25% of families earn less, and 75% earn more.
  • At the 50th percentile (Median), half of the population earns less, and half earns more.

Visual Representation:

Imagine a graph where the x-axis represents percentiles (1 to 99) and the y-axis shows cumulative wealth. Such a graph helps visualize wealth inequality, showcasing how wealth accumulates disproportionately across different percentiles.

Quartiles: Dividing Data Sets

Quartiles split a data set into four equal parts, each representing 25% of the data.

The Four Key Quartiles

  1. First Quartile (Q1): 25% of data falls below this value.
  2. Second Quartile (Q2): Also known as the Median, where 50% of data falls below.
  3. Third Quartile (Q3): 75% of data falls below this value.
  4. Fourth Quartile (Q4): The highest 25% of data points.

Importance of Quartiles

Quartiles are instrumental in understanding data dispersion and central tendency. They are foundational in constructing box plots, which visualize the distribution, identify outliers, and compare different data sets.

Box Plot Components:

  • Box: Represents the interquartile range (IQR) between Q1 and Q3.
  • Median Line: Inside the box, indicating the median (Q2).
  • Whiskers: Extend to the smallest and largest values within 1.5 * IQR from Q1 and Q3.
  • Outliers: Data points beyond the whiskers.

Moments: Mean, Variance, Skewness, and Kurtosis

Moments are quantitative measures related to the shape of a data distribution. The first four moments provide valuable insights into data characteristics:

  1. First Moment (Mean): The average value.
  2. Second Moment (Variance): Measures data dispersion around the mean.
  3. Third Moment (Skewness): Indicates asymmetry in the distribution.
  4. Fourth Moment (Kurtosis): Describes the “tailedness” of the distribution.

Detailed Explanation

1. Mean

The mean is the sum of all data points divided by the number of points. It represents the central value of the data.

\[ \text{Mean} (\mu) = \frac{\sum_{i=1}^{N} x_i}{N} \]

2. Variance

Variance measures how much data points differ from the mean.

\[ \text{Variance} (\sigma^2) = \frac{\sum_{i=1}^{N} (x_i – \mu)^2}{N} \]

A higher variance indicates greater dispersion.

3. Skewness

Skewness quantifies the asymmetry of the data distribution.

  • Positive Skew: Tail extends to the right; mean > median.
  • Negative Skew: Tail extends to the left; mean < median.

\[ \text{Skewness} = \frac{\frac{1}{N} \sum_{i=1}^{N} (x_i – \mu)^3}{\sigma^3} \]

4. Kurtosis

Kurtosis measures the “tailedness” of the distribution.

  • High Kurtosis: More data in the tails; sharper peak.
  • Low Kurtosis: Less data in the tails; flatter peak.

\[ \text{Kurtosis} = \frac{\frac{1}{N} \sum_{i=1}^{N} (x_i – \mu)^4}{\sigma^4} – 3 \]

*(The subtraction of 3 normalizes the kurtosis of a standard normal distribution to zero.)*

Data Distributions: Normal vs. Exponential

Understanding data distributions is pivotal in statistics and machine learning, influencing how models interpret data.

Normal Distribution

Often referred to as the bell curve, the normal distribution is symmetric about the mean, depicting that data near the mean are more frequent.

Characteristics:

  • Mean = Median = Mode
  • Defined by parameters: mean (μ) and standard deviation (σ)
  • Approximately 68% of data falls within ±1σ, 95% within ±2σ, and 99.7% within ±3σ from the mean.

Exponential Distribution

The exponential distribution is primarily used to model the time between events in a Poisson process. It’s characterized by a single parameter, λ (rate).

Characteristics:

  • Asymmetric: Right-skewed with a long tail.
  • Memoryless property: Future probabilities are independent of past events.

Comparison:

While the normal distribution is symmetric, the exponential distribution is skewed, making them suitable for different types of data analyses.

Practical Implementation with Python

To solidify the understanding of these concepts, let’s explore a practical example using Python’s numpy, matplotlib, and scipy libraries.

Generating and Visualizing Data

Output:

Histogram

Calculating Moments

First Moment: Mean

Output:

Second Moment: Variance

Output:

Third Moment: Skewness

Output:

*Indicates a slight negative skew.*

Fourth Moment: Kurtosis

Output:

*Close to zero, indicating a distribution similar to the normal distribution.*

Interpretation

  • Mean (~0): Data is centered around zero.
  • Variance (~2.24): Indicates the spread of data points.
  • Skewness (~-0.00366): Nearly symmetric; slight negative skew.
  • Kurtosis (~0.01309): Flatness compared to a normal distribution is negligible.

Conclusion

A deep understanding of statistical concepts like percentages, percentiles, quartiles, and moments is indispensable for effective data analysis and machine learning. These measures not only provide insights into data distribution and variability but also underpin advanced analytical techniques and model-building processes. By leveraging tools like Python’s numpy and scipy, practitioners can efficiently compute and interpret these statistics, driving informed decision-making and fostering data-driven success.

Whether you’re analyzing financial data, assessing population demographics, or fine-tuning machine learning models, these foundational statistics serve as the bedrock for robust and insightful analysis.

Further Reading

*Empower your data journey by mastering these essential statistical concepts and applying them to real-world scenarios.*

Share your love