Computing for Statistics (Introduction to Statistical Computing)
Three-Toed Sloth 2014-01-06
Summary:
(My notes from this lecture are too fragmentary to post; here's the sketch.)
What should you remember from this class?
Not: my mistakes (though remember that I made them).
Not: specific packages and ways of doing things (those will change).
Not: the optimal algorithm, the best performance (human time vs. machine time).
Not even: R (that will change).
R is establishing itself as a standard for statistical computing,, but you can expect to have to learn at least one new language in the course of a reasonable career of scientific programming, and probably more than one. I was taught rudimentary coding in Basic and Logo, but really only learned to program in Scheme. In the course of twenty years of scientific programming, I have had to use Fortran, C, Lisp, Forth, Expect, C++, Java, Perl and of course R, with glances at Python and OCaml over collaborators' shoulders, to say nothing of near-languages like Unix shell scripting. This was not actually hard, just tedious. Once you have learned to think like a programmer with one language, getting competent in the syntax of another is just a matter of finding adequate documentation and putting in the time to practice it --- or finding minimal documentation and putting in even more time (I'm thinking of you, CAM-Forth). It's the thinking-like-a-programmer bit that matters.
Instead, remember rules and habits of thinking
- Programming is expression: take a personal, private, intuitive, irreproducible series of acts of thought, and make it public, objective, shared, explicit, repeatable, improvable. This resembles both writing and building a machine: communicative like writing, but with the impersonal, it all-must-fit check of the machine. All the other principles follow from this fact, that it is turning an act of individual thought into a shared artifact --- reducing intelligence to intellect (cf.).
-
Top-down design
- What are you trying to do, with what resources, and what criteria of success?
- Break the whole solution down into a few (say, 2--6) smaller and simpler steps
- If those steps are so simply you can see how to do them, do them
- If they are not, treat each one as a separate problem and recurse
-
Modular and functional programming
- Use data structures to group related values together
- Select, or build, data structures which make it easy to get at what you want
- Select, or build, data structures which hide the implementation details of no relevance to you
- Use functions to group related operations together
- Do not reinvent the wheel
- Do unify your approach
- Avoid side-effects, if possible
- Consider using functions as inputs to other functions
- Take the whole-object view when you can
- Cleaner
- More extensible
- Sometimes faster
- Essential to clarity in things like split/apply/combine
- Use data structures to group related values together
-
Code from the bottom up; code for revision
- Start with the smallest, easiest bits from your outline/decomposition and work your way up
- Document your code as you go
- Much easier than going back and trying to document after
- Comment everything
- Use meaningful names
- Use conventional names
- Write tests as you go
- Make it easy to re-run tests
- Keep working on your code until it passes your tests
- Make it easy to add tests
- Whenever you find a new bug, add a corresponding test
- Prepare to revise
- Once code is working, look for common or analogous operations: refactor as general functions; conversely, look if one big function mightn't be split apart
- Look for data structures that show up together: refactor as one object; conversely, look if some pieces of one data structure couldn't be split off
- Be willing to re-do some or all of the plan, once you know more about the problem
- Statistical programming is different
- Bear in