In this Python notebook, we are going to look at Benford’s law, which predicts the leading digit distribution, when dealing with some real-world collections of numbers. This distribution usually occurs when the numbers are rather smoothly distributed over several orders of magnitute. This can be observed with population data, file size data, stock prices, river lengths, …

We are going to check this law with a dataset of the population of all french cities (actually all settlements), as a simple experiment.

Finally, we will look at the generalized version of this first-digit law : the significant-digit law, which also predicts the occurence of other significant digits.

Imports

from zipfile import ZipFile

import requests
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

FS = (16, 9)  # figure size

Benford’s law

Here is the definition from wikipedia :

Distribution

This was first discovered by Canadian-American astronomer Simon Newcomb in 1881, when he

noticed that in logarithm tables the earlier pages (that started with 1) were much more worn than the other pages.

Let’s have a look at this distribution :

benford = pd.DataFrame(
    data={
        "proba": [np.log10(1 + 1 / d) for d in range(1, 10)],
        "leading_digit": range(1, 10),
    }
).set_index("leading_digit")
ax = benford.plot(
    legend=False, grid=True, alpha=0.5, figsize=FS, rot=0, style="o", ms=20
)
_ = ax.set(
    title="Benford's law", 
    xlabel="Leading digit", 
    ylabel="Probability of occurence"
)

Benford's law

Now let’s check our population data.

Population of french cities

Data from 2017 can be found here, from the website of the french institute of statistics and economic studies (INSEE). The chosen archive has several CSV files dealing with various administrative levels. We are going to use the commune level. Regarding population, communes are the same thing as cities when they have more than 2000 inhabitants, except for the 3 largest cities: Paris, Marseille and Lyon, where a commune corresponds to an “arrondissement”, which is some kind of district (e.g. twenty arrondissements in Paris). Below 2000 inhabitants, they correspond to villages. The data covers mainland France as well as overseas departments and regions.

Let’s start by downloading and extracting the data :

url = "https://www.insee.fr/fr/statistiques/fichier/4265429/ensemble.zip"
r = requests.get(url, allow_redirects=True)
file_name = "ensemble.zip"
_ = open(file_name, "wb").write(r.content)
with ZipFile(file_name, "r") as zip:
    zip.extractall()
!ls *.csv
Arrondissements.csv		     meta_arrondissements.csv
Cantons_et_metropoles.csv	     meta_associe.csv
Collectivites_d_outre_mer.csv	     meta_cantons.csv
Communes.csv			     meta_com.csv
Communes_associees_ou_deleguees.csv  meta_communes.csv
Departements.csv		     meta_departements.csv
Fractions_cantonales.csv	     meta_fractions.csv
Regions.csv			     meta_regions.csv

The communes.csv file has 5 columns and 34995 rows :

df = pd.read_csv("Communes.csv", sep=";")
df.head(2)
DEPCOM COM PMUN PCAP PTOT
0 01001 L' Abergement-Clémenciat 776 18 794
1 01002 L' Abergement-de-Varey 248 1 249

DEPCOM is some kind of ID code
COM is the commune name
PTOT is the total population
We are only interested in the PTOT column.

df = df[df.PTOT > 0].copy(deep=True)
number_of_records = len(df)
min_value = df.PTOT.min()
max_value = df.PTOT.max()
orders_of_magnitude = int(np.log10(max_value)) - int(np.log10(min_value))
print(f"number_of_records : {number_of_records}")
print(f"min value : {min_value}")
print(f"max_value : {max_value}")
print(f"orders of magnitude : {orders_of_magnitude}")
number_of_records : 34989
min value : 1
max_value : 484809
orders of magnitude : 5

Now we store the leading digit and compute the probability of occurence :

df["leading_digit"] = df.PTOT.map(lambda x: int(str(x)[0]))
ld_occ_proba = df.leading_digit.value_counts() / len(df)

We can observe that the data follows Benford’s law pretty well :

ax = ld_occ_proba.plot.bar(figsize=FS, alpha=0.5, label="Observed data")
ax = benford.reset_index(drop=False).plot.scatter(
    x="leading_digit",
    y="proba",
    marker="o",
    s=500,
    ax=ax,
    alpha=0.5,
    label="Benford's law",
)
ax.grid()
_ = ax.set(
    title="Benford's law on the population of french communes",
    xlabel="Leading digit",
    ylabel="Probability of occurence",
)
_ = ax.set_ylim(
    0,
)
_ = ax.legend()

french communes

Quick explanation

First we can observe that the values of Benford’s distribution corresponds to the spacing in a log scale bar :

_ = plt.subplots(1, 1, figsize=FS)
ymin, ymax = -1, 1
xs = [0] + list(benford.proba.cumsum().values)
for x in xs:
    plt.vlines(x, ymin, ymax)
xs_h = [0.5 * (xs[i] + xs[i + 1]) for i in range(len(xs) - 1)]
for i, x in enumerate(xs_h):
    plt.text(x - 0.01, 0, str(i + 1), fontsize=25)

Log scale

So if numbers are uniformly distributed in the log space over one order of magnitude, the probability that they start with the digit $i$ is actually equal to the width of the $i$-th column above : \begin{equation} P(D_1= d) = log_{10}(d+1) - log_{10}(d) = log_{10} \left( 1 + \frac{1}{d} \right) \end{equation}

Let’s generate some fake numbers uniformly in the log space, across several orders of magnitude, and check that they perfectly follow Benford’s law :

n = 1_000_000
l, u = 1, 5
fake = pd.DataFrame(data={"gen": 10 ** (l + (u - l) * np.random.rand(n))})

coef = 10
bins = 10 ** (np.arange(0, 6 * coef) / coef)
plt.figure(figsize=FS)
plt.xscale("log")
plt.grid(True, which="both", ls="-")
_ = plt.hist(fake.gen.values, bins=bins, alpha=0.5)
_ = plt.xlabel("Numbers (Log scale)")
_ = plt.ylabel("Occurance")
_ = plt.title("Distribution of the fake number collection")

Fake

fake["leading_digit"] = fake.gen.map(lambda x: int(str(x)[0]))
fake_ld_occ_proba = fake.leading_digit.value_counts() / len(fake)
ax = fake_ld_occ_proba.plot.bar(figsize=FS, alpha=0.5, label="Fake data")
ax = benford.reset_index(drop=False).plot.scatter(
    x="leading_digit",
    y="proba",
    marker="o",
    s=500,
    ax=ax,
    alpha=0.5,
    label="Benford's law",
)
ax.grid()
_ = ax.set(
    title="Benford's law on fake data",
    xlabel="Leading digit",
    ylabel="Probability of occurence",
)
_ = ax.set_ylim(
    0,
)
_ = ax.legend()

Fake

If we look at the population’s distribution in the log space, we can see that it is rather spread out, across several orders of magnitude.

coef = 10
bins = 10 ** (np.arange(0, 6 * coef) / coef)
plt.figure(figsize=FS)
plt.xscale("log")
plt.grid(True, which="both", ls="-")
n, bins, patches = plt.hist(df.PTOT.values, bins=bins, alpha=0.5)
_ = plt.xlabel("Population (Log scale)")
_ = plt.ylabel("Occurance")
_ = plt.title("Distribution of the communes' population")

Distribution

So even if it is not uniform in the log space, it is smooth and wide enough to approximately follow Bendford’s law. As explained in wikipedia:

Benford’s law can be seen in the larger area covered by red (first digit one) compared to blue (first digit 8) shading.

Distribution

For each order of magnitude, the width corresponding to the leading digit 1 is almost 6 times larger than the one corresponding to the leading digit 8. So overall, the cumulative area under the distribution curve corresponding to the leading digit 1 is also about 6 times larger than the one corresponding to the leading digit 8.

Now, an interesting thing is that it Bendford’s law can be easily generalized to every significant digit. Let’s apply the second-digit version of the law to the population dataset.

Second significant digit

Here is the generalization of Benford’s law, from Hill [1] :

\begin{equation} P(D_1 = d_1, … ,D_k = d_k) = log_{10} \left[ 1 + \left( \sum_{i=1}^k d_i 10^{k-i} \right)^{-1} \right], \; k \geq 1 \end{equation}

where $d_j$ is the $j$-th significant digit ($d_1 \in (1, …, 9)$ and $d_j \in (0, …, 9)$ for $j>1$).

So the distribution of the second digit is the following one :

\begin{equation} P(D_2 = d_2) = \sum_{d_1=1}^9 P(D_1 = d_1, D_2 = d_2) = \sum_{d_1=1}^9 log_{10} \left[ 1 + \frac{1}{10 \; d_1 + d_2} \right] \end{equation}

Let’s have a look at this second-digit distribution :

benford_2 = pd.DataFrame(data={"second_digit": range(10)}, index=range(10))

benford_2["proba"] = 0.0
for d1 in range(1, 10):
    benford_2.proba += np.log10(1 + 1 / (10 * d1 + benford_2.second_digit))

benford_2.set_index("second_digit", inplace=True)
ax = benford_2.plot(
    legend=False, grid=True, alpha=0.5, figsize=FS, rot=0, style="o", ms=20
)
_ = ax.set(
    title="Benford's law for the second significant digit",
    xlabel="Second significant digit",
    ylabel="Probability of occurence",
)

Second-digit

Although the second-digit distribution is not uniform, it is less uneven than the first-digit one.

Now we store the second digit and compute the probability of occurence :

df["second_digit"] = 0
df.loc[df.PTOT > 9, "second_digit"] = df[df.PTOT > 9].PTOT.map(lambda x: int(str(x)[1]))
sd_occ_pc = df.second_digit.value_counts() / len(df)

Again, we can observe that the data follows (the generalized) Benford’s law pretty well :

ax = sd_occ_pc.plot.bar(figsize=FS, alpha=0.5, label="Observed data")
ax = benford_2.reset_index(drop=False).plot.scatter(
    x="second_digit",
    y="proba",
    marker="o",
    s=500,
    ax=ax,
    alpha=0.5,
    label="Benford's law",
)
ax.grid()
_ = ax.set(
    title="Second-digit law on the population of french communes",
    xlabel="Second significant digit",
    ylabel="Probability of occurence",
)
_ = ax.set_ylim(
    0,
)
_ = ax.legend()

Second-digit

Reference

[1] Hill, Theodore P. A Statistical Derivation of the Significant-Digit Law. Statist. Sci. 10 (1995), no. 4, 354–363. doi:10.1214/ss/1177009869.