Start at zero, or start at wherever

Junk Charts 2022-01-03

Andrew's post about start-at-zero helps me refine my own thinking on this evergreen topic.

The specific example he gave is this one:

Andrewgelman_invitezeroin

The dataset is a numeric variable (y) with values over time (x). The minimum numeric value is around 3 and the range of values is from around 3 to just above 20. His advice is "If zero is in the neighborhood, invite it in". (Link)

The rule, as usual, sounds simpler than it really is. In the discussion, Andrew highlights several considerations.

Is zero a meaningful reference value? In his example, we assume it is and so we invite zero in. But, as Andrew also says, if zero is meaningless, then recall the invitation. So context must be accounted for.

In Chapter 1 of Numbersense (link), I looked at some SAT score data of applicants to competitive colleges. Is zero a meaningful reference value for SAT scores? Someone might argue yes, since it is the theoretical minimum score that anyone could get from the test. Any statistician will likely say no, since a competitive college will have never seen an applicant submitting a score of zero, or anywhere close to zero. Thus, starting such a chart at zero inserts a lot of whitespace and draws attention to a useless insight - how far above the theoretical worst performer is someone's score.

***

What about the left panel of Andrew's chart makes us uncomfortable? I ask myself this question. My answer is that the horizontal axis highlights an arbitrary value that distracts from the key patterns of the data.

As shown below, the arbitrary value is ~2.5. This is utterly meaningless.

Redo_andrewgelman_invitezeroin

What if 0 is also a meaningless value for this dataset? I'd recommend "bench the axis". Like this:

Redo_andrewgelman_benchtheaxis

An axis is a tool to help readers understand a chart. If it isn't serving a function, an axis doesn't need to be there. When I choose a line chart for time-series data, I'm drawing attention to temporal change in the numeric values, or the range of values. I'm not saying something about the values relative to some reference number.

From this example, we also see that the horizontal axis should not be regarded as a hanger for time labels. Time labels can exist by themselves.