The Pareto Distribution: From Intuition to Implementation in Python
Updated on April 24, 2025 5 minutes read

1 Why care? (A quick visual)
“Roughly 80% of GitHub stars sit on 20% of projects.”
That claim feels right—open-source fame is lopsided. But how lopsided, exactly? Plot a histogram of star counts and the right tail stretches far beyond the median. Heavy-tailed data like this are commonplace in wealth, city sizes, neural-network gradients, and website traffic. The simplest continuous model for such behaviour is the Pareto (Type I) distribution.
2 Pareto principle ≠ Pareto distribution
- Pareto principle (80/20 rule) – an empirical heuristic: in many settings, 80% of outcomes stem from 20% of causes.
- Pareto distribution – a parametric power-law family with PDF:
$$ f(x; \alpha, x_m) = \begin{cases} \displaystyle \frac{\alpha,x_m^{\alpha}}{x^{\alpha+1}}, & x \ge x_m, \[6pt] 0, & x < x_m. \end{cases} $$
Only when the shape parameter
$$ \alpha = \log_{4} 5 \approx 1.16 $$
does the distribution reproduce the exact 80/20 split.
3 Mathematical essentials
**Concept** | **Formula** | **Practical takeaway** |
CDF | $F(x) = 1 - \bigl(\frac{x_m}{x}\bigr)^\alpha$ | Closed-form means fast simulation. |
Survival | $S(x) = 1 - F(x) = \bigl(\frac{x_m}{x}\bigr)^\alpha$ | Straight line on a log–log plot → easy tail checking. |
Mean | finite _iff_ $\alpha > 1$: $\displaystyle \mathbb{E}[X] = \frac{\alpha\,x_m}{\alpha - 1}$ | For $\alpha \le 1$ the mean literally **does not exist**. |
Variance | finite _iff_ $\alpha > 2$ | Tail weight is so large that variance can explode. |
4 Simulating a Pareto in Python
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import pareto # SciPy ≥1.11
alpha, x_m = 2.5, 1 # shape, scale
n = 50_000
# inverse-CDF sampling
samples = x_m * (1 - np.random.rand(n))**(-1/alpha)
# Plot histogram and theoretical PDF
plt.hist(samples, bins=200, density=True, alpha=0.4)
x = np.linspace(x_m, samples.max(), 400)
plt.plot(x, pareto.pdf(x, alpha, scale=x_m), lw=2)
plt.xscale('log')
plt.yscale('log')
plt.xlabel('x')
plt.ylabel('PDF (log-log)')
plt.title('Simulated Pareto(α=2.5)')
plt.show()
scipy.stats.pareto offers the same functionality, but the one-liner above emphasizes the inverse-transform trick.
5 Fitting the distribution
5.1 Maximum-likelihood (SciPy)
alpha_hat, loc_hat, scale_hat = pareto.fit(data, floc=0) # force x_m > 0
alpha_hat is the MLE of the tail index. For a true Pareto it is asymptotically efficient.
5.2 Hill estimator (tail-only)
def hill(x, k):
x = np.sort(x)[::-1] # descending
return k / np.sum(np.log(x[:k] / x[k]))
alpha_hat = hill(data, k=1000)
Use the largest (k) order statistics; choose (k) via stability plots.
5.3 Bootstrap confidence interval
import random
import numpy as np
from scipy.stats import pareto
alphas = []
for _ in range(999):
resample = random.choices(data, k=len(data))
alphas.append(pareto.fit(resample, floc=0)[0])
ci = np.percentile(alphas, [2.5, 97.5])
print(f"95% CI for α: {ci[0]:.3f}–{ci[1]:.3f}")
6 Model diagnostics
- Q–Q plot of log-data vs fitted distribution.
- Kolmogorov–Smirnov test (
scipy.stats.kstest
). - Tail comparison – overlay Pareto and (e.g.) lognormal PDFs on double-log axes; genuine power laws appear as straight lines.
from scipy.stats import kstest, lognorm
ks_p = kstest(data, 'pareto', args=(alpha_hat, 0, scale_hat)).pvalue
print(f'KS p-value: {ks_p:.3f}')
If a lognormal gives higher log-likelihood, your data might only look heavy-tailed.
7 Case Study: GitHub Stars
Download a public snapshot of the most-starred repositories (CSV, ~20 k repos).
import pandas as pd
import seaborn as sns
from scipy.stats import pareto
import matplotlib.pyplot as plt
import numpy as np
# Load data
stars = pd.read_csv('most_starred_repos.csv')['Stars'].values
# Filter and rescale
xmin = 50 # only repos with ≥50 stars
data = stars[stars >= xmin] / xmin # re-scale so x_m = 1
# Fit Pareto via MLE
alpha_hat, loc_hat, scale_hat = pareto.fit(data, floc=0, fscale=1)
print(f'α = {alpha_hat:.2f}')
# Visual check
sns.ecdfplot(data=data, log_scale=True)
plt.plot(np.sort(data),
pareto.cdf(np.sort(data), alpha_hat, scale=1))
plt.xlabel('Stars / 50')
plt.ylabel('CDF')
plt.show()
A fitted alpha_hat of ~1.3 confirms a fat right tail—close to the fabled 80/20 but not identical.
8 When Not to Use Pareto
- Finite upper bound (e.g., percentages) – tails cannot be infinite.
- Multi-modal data – a single power law misses clusters.
- Variance matters – if $\alpha \le 2$, variance is infinite, so variance-based metrics fail.
- Regulatory/actuarial constraints – heavy tails violate solvency requirements unless tamed.
9 Beyond Classical Statistics: Pareto in ML
- Stochastic-gradient magnitudes in deep nets follow power-law spectra, motivating adaptive learning-rate schedules.
- Attention-weight distributions in large-language models show Pareto-like sparsity, inspiring mixed-precision storage.
- Neural scaling laws: empirical exponents for loss vs. compute can be linked to heavy-tailed weight updates.
10 One-Page Cheat Sheet
Topic | Formula / Command |
`pareto.pdf(x, α, scale=x_m)` | |
CDF | `pareto.cdf(x, α, scale=x_m)` |
Draw samples | `pareto.rvs(α, scale=x_m, size=n)` |
Fit MLE | `pareto.fit(data, floc=0)` |
Mean exists | `α > 1` |
Variance exists | `α > 2` |
80/20 shape | `α ≈ 1.16` |
11 Key Takeaways
- Pareto ≠ 80/20: the principle is a rule of thumb, while the distribution is a model with tunable α.
- Check the tail: log–log plots and survival functions reveal true power laws.
- Fit responsibly: use MLE/Hill estimators, bootstrap confidence intervals, and compare against lognormal.
- Mind the moments: for α less than 2, variance is infinite—plan analyses accordingly.
- Heavy tails are everywhere—understanding Pareto equips you for real-world, messy data.
Further Reading
- SciPy
rv_continuous
documentation forpareto
. - Wikipedia pages on Pareto principle and Pareto distribution for historical context.
Happy heavy-tail modelling!
Ready to Dive Deeper?
Level up your probabilistic modelling skills with Code Labs Academy’s Data Science & AI Bootcamp—a hands-on program that takes you from core statistics to deploying production-grade ML.