Books to Read While the Algae Grow in Your Fur, July 2020
Three-Toed Sloth 2021-04-04
Summary:
Attention conservation notice: I have no taste, and no qualifications to say anything about climatology, central Asian history, W. E. B. Du Bois, criminology, or post-modernity.
- Samuel S. P. Shen and Richard C. J. Somerville, Climate Mathematics: Theory and Applications
- I wanted to like this book much more than I did. It goes over some important pieces of math, not just for climatology but for lots of STEM fields, and the aim is "here's the main idea and how you use it, and leave the rigor to those who want it"; and it handles numerics in R. I was hoping to assign, if not all of it, then at least large chunks to my class on spatio-temporal statistics. As it is, I will just mine it for examples, but I won't even feel totally confident in that, unless I re-do them all.
- To illustrate why, the final chapter is on "R Analysis of Incomplete Climate Data". This is good thing to include in an intro book, because (as they quite rightly say) real data sets almost always have missing values. They use a temperature data set from NCAR where missing values have been coded as -999, which is usually a bad practice (the Ancestors standards committees on floating-point numerical computation gave us NA for a reason), but, since they're Celsius temperatures, a value below absolute zero should be a warning to an alert user. After doing several examples where the -999.00s are taken literally, Shen and Somerville correctly say that coding missing values as -999.00 "can significantly impact the computing results" --- so "We assign missing data to be zero" (p. 286)! (Their code does not assign re-assign -999.00s to be zero, but without such a missing re-assignment their code would not produce the figure which follows this piece of text.) Even more astonishing, in section 11.4 (pp. 295ff), they handle this in the correct way, by replacing the -999s (strictly, values $< -490$) with NAs. In between, in section 11.3 (pp 293--295), they fit 9th and 20th (!) order polynomials to an annual temperature series from 1880--2016. "The choice of the 20th-order polynomial fit is because it is the lowest-order orthogonal polynomial that can mimic the detailed climate variations... We have tried higher-order polynomials which often show an unphysical overfit." (p. 294) --- I bet they do! The term "cross-validation" does not appear in the index, or I believe in the book. These are especially gross mis-steps, but I fear that stuff like this is lurking inthe data-analytic examples.
- Other errors / causes of unhappiness (selected):
- Pp.127--128, the chemical symbol for helium is repeatedly given as "He2", though helium is, of course, a monoatomic gas (with an atomic weight of 3 or 4, depending on the isotope).
- P. 141, "Clearly the best linear approximation to the curve $y=f(x)$ at a point $x=a$ is the tangent line at $(a, f(a))$", with slope $f^{\prime}(a)$. This is not clear at all! If you want approximation at that point only, any line which goes through that point will work equally well, regardless of its slope. If you want approximation over some range, then the slope of the optimal linear approximation (in the mean-squared sense) is given by $\mathrm{Cov}(X, f(X))/\mathrm{Var}(X)$, which will equal $f^{\prime}(a)$ if $f(x)$ is a linear function. Now over a sufficiently narrow range, a well-behaved function will be well-approximated by the tangent line, i.e., a first-order Taylor approximation will work well. What counts as a "sufficiently narrow range" will depend on (i) how good an approximation you demand and (ii) the size of the remainder in Taylor's theorem. Since that remainder is $\propto (x-a)^2 f^{\prime\prime}(a)$, we need $|x-a|$ to be negligble compared to $1/\sqrt{|f^{\prime\prime}(a)|}$, which is a measure of the local curvature of the function. Requiring $|x| \ll 1$, as the authors do repeatedly, is neither here nor there.
- The book opens on a chapter with dimensional analysis. There is a good point to make here, which is that the units on both sides of an equation need to balance, and so the arguments to transcendental functions (like $e^x$ or $\log{x}$ or $\sin{x}$ or $\Gamma(x)$) should be dimensionless (generally, ratios of quantities with physical dimensions). This is a good way to avoid gross mistakes. But of course you can always make the units balance by sticking in the appropriate scaling factor on one side or another of the equation *. (When you do linear regression, $Y = \beta X + \mathrm{noise}$, the units of $\beta$ are always $\frac{[Y]}{[X]}$, and, e.g., an ordinary least squares estimate will respect this by construction.) Our authors want however to persuade the reader tha