log(A + x), not log(1 + x)

Statistical Modeling, Causal Inference, and Social Science 2024-08-31

This recent discussion in comments reminds me of something that comes up in regression modeling from time to time. You have a predictor on an outcome which you’d like to model on the log scale—the usual reason, and it’s a good one, is that you’d like your model to have multiplicative effects, which are linear on the logarithmic scale—but some of the observations are 0, and you can’t take the log of 0. Often what’s done is to take log(1 + x).

At best this is a shortcut to whatever model you’d really like to be fitting, but that’s fine—we take lots of shortcuts in statistics, no apology for that.

The point of this post is that if you’re gonna work with log(1 + x), you should work with log(A + x), and set A to some reasonable value based on the context of your problem. If you want to set A = 1, do that only in the context of scaling x appropriately.

From a mathematical standpoint this is obvious: log(1 + x) is a special case of log(A + x), and the appropriate value of A has to depend on the scaling of x. But so so so so often I see people unreflectively doing log(1 + x) without seeming to recognize the issue. Hence this post.

P.S. Also relevant is this post from forever ago on the 1/4-power transformation.