Benford law and lognormal distributions

R-bloggers 2013-03-29

(This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers)

Benford’s law is nowadays extremely popular (see e.g. http://en.wikipedia.org/…). It is usually claimed that, for a given set data set, changing units does not affect the distribution of the first digit. Thus, it should be related to scale invariant distributions. Heuristically, scale (or unit) invariance means that the density of the measure $http://latex.codecogs.com/gif.latex?%20X$ (or probability function) $http://latex.codecogs.com/gif.latex?f(x)$ should be proportional to $http://latex.codecogs.com/gif.latex?f(kx)$ . Thus, because densities integrate to 1, the proportionality coefficient has to be $http://latex.codecogs.com/gif.latex?k^{-1}$ , and therefore, $http://latex.codecogs.com/gif.latex?f$ should satisfy the following functional equation, $http://latex.codecogs.com/gif.latex?%20kf(kx)=f(x)$ , for all $http://latex.codecogs.com/gif.latex?%20x$ in $http://latex.codecogs.com/gif.latex?%20(1,\infty)$ and $http://latex.codecogs.com/gif.latex?%20k$ in $http://latex.codecogs.com/gif.latex?%20(0,\infty)$ . The solution of this functional equation is $http://latex.codecogs.com/gif.latex?%20f(x)=x^{-1}$ , I guess this can be proved easily solving ordinary differential equation

$http://latex.codecogs.com/gif.latex?%20\frac{d}{dk}%20(kf(kx))=0$

Now if $http://latex.codecogs.com/gif.latex?%20D$ denotes the first digit of $http://latex.codecogs.com/gif.latex?%20X$ , in base 10, then

$http://latex.codecogs.com/gif.latex?%20\mathbb{P}(D=d)=\frac{\displaystyle{\int_d^{d+1}%20f(x)dx}}{{\displaystyle{\int_1^{10}%20f(x)dx}}}=\cdots=\frac{\displaystyle{\log\left(1+\frac{1}{d}\right)}}{\log(10)}$ Which is the so-called Benford’s law. So, this distribution looks like that

> (benford=log(1+1/(1:9))/log(10))[1] 0.30103000 0.17609126 0.12493874 0.09691001 0.07918125 [6] 0.06694679 0.05799195 0.05115252 0.04575749> names(benford)=1:9> sum(benford)[1] 1> barplot(benford,col="white",ylim=c(-.045,.3))> abline(h=0)

To compute the empirical distribution from a sample, use the following function

> firstdigit=function(x){+ if(x>=1){x=as.numeric(substr(as.character(x),1,1)); zero=FALSE}+ if(x<1){zero=TRUE}+ while(zero==TRUE){+ x=x*10; zero=FALSE+ if(trunc(x)==0){zero=TRUE}+ }+ return(trunc(x))+ }

and then

> Xd=sapply(X,firstdigit)> table(Xd)/1000

In Benford’s Law: An Empirical Investigation and a Novel Explanation, we can read

It is not a mathematical article, so do not expect any formal proof in this paper. At least, we can run monte carlo simulation, and see what’s going on if we generate samples from a lognormal distribution with variance $http://latex.codecogs.com/gif.latex?%20\sigma^2$ . For instance, with a unit variance,

> set.seed(1)> s=1> X=rlnorm(n=1000,0,s)> Xd=sapply(X,firstdigit)> table(Xd)/1000Xd    1     2     3     4     5     6     7     8     9 0.288 0.172 0.121 0.086 0.075 0.072 0.073 0.053 0.060 > T=rbind(benford,-table(Xd)/1000)> barplot(T,col=c("red","white"),ylim=c(-.045,.3))> abline(h=0)

Clearly, it not far away from Benford’s law. Perhaps a more formal test can be considered, for instance Pearson’s $http://latex.codecogs.com/gif.latex?%20\chi^2$ (goodness of fit) test.

> chisq.test(T,p=benford)Chi-squared test for given probabilitiesdata:  T X-squared = 10.9976, df = 8, p-value = 0.2018

So yes, Benford’s law is admissible ! Now, if we consider the case where $http://latex.codecogs.com/gif.latex?%20\sigma$ is smaller (say 0.9), it is a rather different story,

compared with the case where $http://latex.codecogs.com/gif.latex?%20\sigma$ is larger (say 1.1)

It is possible to generate several samples (always the same size, here 1,000 observations), just change the variance parameter $http://latex.codecogs.com/gif.latex?%20\sigma$ and compute the $http://latex.codecogs.com/gif.latex?%20p$ -value of the test. There might be one tricky part: when generating samples from lognormal distributions with small variance, it might be possible that some digits do not appear at all. On that case, there is a problem with the test. So we just use here

> T=table(Xd)> T=T[as.character(1:9)]> T[is.na(T)]=0> PVAL[i]=chisq.test(T,p=benford)$p.value

Boxplots of the $http://latex.codecogs.com/gif.latex?%20p$ -value of the test are the following,

When $http://latex.codecogs.com/gif.latex?%20\sigma$ is too small, it is clearly not Benford’s distribution: for half (or more) of our samples, the $http://latex.codecogs.com/gif.latex?%20p$ -value is lower than 5%. On the other hand, when $http://latex.codecogs.com/gif.latex?%20\sigma$ is large (enough), Benford’s distribution is the distribution of the first digit of lognormal samples, since 95% of our samples have $http://latex.codecogs.com/gif.latex?%20p$ -values higher than 5% (and the distribution of the $http://latex.codecogs.com/gif.latex?%20p$ -value is almost uniform on the unit interval). Here is the proportion of samples where the $http://latex.codecogs.com/gif.latex?%20p$ -value was lower than 5% (on 5,000 generations each time)

Note that it is also possible to compute the $http://latex.codecogs.com/gif.latex?%20p$ -value of Komogorov-Smirnov test, testing if the $http://latex.codecogs.com/gif.latex?%20p$ -value has a uniform distribution,

> ks.test(PVAL[,s], "punif")$p.value

Indeed, if $http://latex.codecogs.com/gif.latex?%20\sigma$ is larger than 1.15 (around that value), it looks like Benford’s law is a suitable distribution for the first digit.

Arthur Charpentier

Arthur Charpentier, professor in Montréal, in Actuarial Science. Former professor-assistant at ENSAE Paristech, associate professor at Ecole Polytechnique and assistant professor in Economics at Université de Rennes 1. Graduated from ENSAE, Master in Mathematical Economics (Paris Dauphine), PhD in Mathematics (KU Leuven), and Fellow of the French Institute of Actuaries.