R’s Garden of Probability Distributions

R-bloggers 2013-03-21

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Joseph Rickert

Ifyou type ?Distributions at the R console you get a list of the 21 probabilitydistributions included in the stats package that ships with base R. The samelist appears in the Introduction to R Manual on CRAN and in most of the many fine introductory books available for the R language. These are indeed fundamental distributions, sufficient formost elementary work in probability and statistics. The fact that the R functionsimplementing these distributions all follow same syntax greatlyeases a beginner's task of trying to get some useful work done with a minimumof memorization.

The following figure shows plots of the cumulative distributionpgamma()and probabilitydensity function dgamma() alongwith the histogram of random draws from a gamma distribution rgamma(2,2)with shape and scale parameters both set to 2.

Gamma-2-2_plot

However, if a person isn’t familiar with how informationabout R is organized on CRAN, he or she might conclude:  “that’s it” or most of it anyway, with respectto R and probability distributions. Imagine the surprise then of a person withsuch modest expectations about R’s probability distributions accidentlystumbling into the overgrown garden of R’s Probability Distributions Task View. I think my first reaction was kind of glazed over inabilityto take it all in.

However, if you just let your eyes relax and pick out aflower with which you are familiar, binomial for example, you can see that thechief gardener Christophe Dutang, listed as the maintainer of the Task View, and the eight individualswhom acknowledges have done a remarkable job of organizing the distributionsaccording to their genus (discrete or continuous), species (binomial in thiscase) and variety (truncated binomial and zero inflated binomial). I can’timagine the number of volunteer hours took to assemble this page, and keepingit up to date can’t be easy either. Ispent a half hour or so just trying to count the distributions. Not countingcopulas, random matrices and other exotica I came up with 31 discrete, 133continuous and 9 mixture distributions. Others may count more or less dependingon how they group things together. It seems as if few people outside of thefolks at Wikipedia have given much thought to the taxonomy of probabilitydistributions and only Mathematica 9 which includes 130 probability distributions comes close to cultivating so many distributions in onecoherent system. (To be fair, the online documentation for SAS, Matlab and SPSS is so distributed that it is difficult to determine how many probability distrbutions have ben implemented in these software packages.)

While the Probability Distributions Task view may be theplace to start for information about probability distributions, the complete R documentationis itself an open ended, organic system that depends on the communication styleof package authors and the experiences of everyone who leaves a record of theirattempts to work with probability distributions.

The entire ecosystem of R documentationfor a probability distribution function starts with the command line help (e.g. ?pgamma) and the package pdf on CRAN that includes the function, but may also include, vignettes,external web pages, blog posts and questions and discussions on help bulletinboards such as the R mailing lists and StackOverflow. Forsome typical examples, consider that the actuar package from Vincet Goulet et al. which provides a number of distributions of interest to acturies has six vignettes, while Thomas Yee's VGAM package for Vector Generalized Linear and Additive Models, a source for many R probability distributions, has a web page as well as a vignette.

JohnD. Cook’s clickable diagram for elementary probability distributions is hosted on his private website while and the paper by Delignette-Muller et al. on fitting distributions with R’sfitdistrplus package is hosted on an academic website. Mage's post from December 2011 on fitting distributions in R is an example of the many blog posts that deserve a second look.

As a final example of how the community comes to play a partof the extended documentation for R, consider my attempt get a handle on theCauchy distribution. Here I ran the below and got four verydifferent looking plots. This is not unexpected given that I’m working withrandom draws from a probability distribution for which both the mean andvariance are not defined. But why only two bins for the histograms?

4_cauchy_plots Well, I wasn’t the first person to pause for a moment over this. Someone recently askedthis question on StackOverflow and received some good advice.

Hatsoff and thank you to everyone involved in cultivating R’s garden of probabilitydistributions

# Cauchy plotsn <- 10000location <- -1scale <- 4par(mfrow=c(2,2))# Make four plots    for(i in 1:4){     y <- rcauchy(n, location, scale)     hist(y, freq = FALSE, col = rainbow(6),     main="random draw from rcauchy(-1,4)")     fd <- function(y)dcauchy(y,shape,scale)     curve(fd, col = "black", add = TRUE,lwd=2)     rug(y,col="grey")               }

 

 

To leave a comment for the author, please follow the link and comment on his blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...