Twin peaks: an energetic exploration
Numbers Rule Your World 2022-03-08
Long-time reader Joe D. alerted me to this blog post at Stackoverflow (link).
It's a nice example of an exploratory data analysis that goes deeper than typical such analyses. The authors summarize their lessons differently than I'd have.
The authors' main message is "embrace complexity" of data. Their sworn enemy is over-aggregation,
They demonstrate over-aggregation using the following scarecrow - the chart showing the 4-week moving average of energy consumption in California in 2020:
This trend line shows a seasonal peak in energy consumption in the summer. (There is a second - winter - peak hiding between the two edges of the time line.)
Let's be clear about what they mean by "aggregation". The raw data gives energy consumption in hourly "ticks" over the year, so presumably 365 days *24 hours/day = 8,760 values in all. If all those values were plotted directly, we will have a "noisy" line that is hard to read because our eyes can't differentiate between different sources of variation (from hour to hour, from day to day, day of week, week of year, special events, etc.).
The "aggregation" technique used here - the moving average - singles out one of those sources of variation, the long-term trend. A 4-week moving average is at the same level of granularity as the raw data, that is to say, there is a value for every hour, the only difference is that each hourly value is now the average of the previous 4 weeks' worth of hourly values (4 weeks * 7 days per week * 24 hours per day = 672 values). The point of the moving average is to "smooth" the line, which allows us to see the long-term trend. Smoothing gets rid of the "noise", defined here as the other sources of variation.
The complaint by the authors is that this moving average removes too much information - throwing away signal with the noise. (This is different from removing "data." As I explained above, the moving-average line still contains 8,000-plus values.) In my view, they are delivering a different lesson.
In terms of the moving average, it isn't one analysis but a collection of analyses because the size of the averaging window is up to the user. These authors selected a 4-week moving average, which means that their moving-average line has a memory of 4 weeks; anything that happens more than 4 weeks ago is presumed to have no impact on the current value, while each hour during the prior 4 weeks is presumed to have equal impact. Some other analyst could have produced a 4-hour moving average, for example. In this case, each value of the 4-hour moving average line is the average of the prior 4 hours of data. Such a line will be much more ragged than the 4-week moving-average line.
Besides, each of the 8,000-plus values is an "aggregate" over all locations in Texas, and over finer time intervals (such as seconds). So the valid question is always at what level of aggregation, rather than with/without aggregation. To decide the level of aggregation, one needs to specify the goal of the analysis.
***
The key lesson of the blog post is the practice of exploratory data analysis as an iterative process between two phases: refining the question, and transforming the data. As the authors said: "We found ourselves repeatedly changing how we visualized the data to reveal the underlying signals."
These signals occur at many levels. If the objective of the study is to surface the long-term trend in energy consumption, then it makes sense to look at the 4-week moving average. That is a good starting point. It is what happens next that is of interest. The analyst takes what is seen in that chart, form hypotheses as to where further signals will be found, then dive into the data to look for confirmation.
The long blog post describes one such journey, which led to the following insightful chart:
The raw data have been re-organized. Each line on this chart represents a week of the year so there are roughly 50 lines. Each line traces hours of the day through days of the week starting with Sunday and ending with Monday.
By no "aggregation", the authors mean they plotted every value in the raw dataset without transformation. I don't see that as a virtue in and of itself. What makes the above chart insightful is the specific way of arranging the data. This chart illustrates the concept of comparing "like to like". Readers can easily compare Sundays to Sundays, and Sunday mid-days to Sunday mid-days.
The key feature is an intra-day pattern of two peaks, a higher peak later in the evening, and a lower peak in the morning. (Unfortunately, they left off the time axis so that readers must infer the timing of the peaks from the other charts on the post.)
The chart also signals that this is a stop on the journey not the final destination. For the chart shows that the day-of-week effect is weak. There is little differentiation between each panel of the chart - this display is, roughly speaking, the same chart duplicated seven times. The authors recognized this, and the next iteration involves breaking out the year into summer versus winter.
***
I enjoyed reading the post. I just don't agree with the claim that "aggregation is the standard best practice for analyzing time series data". There are those who practice this - and Excel's trendline tool makes it easy to execute. However, I don't think any textbook on time-series analysis recommends moving averages as the best and greatest tool in the shed. The lesson of the blog post is different: it's the thinking process of how to interact with the data, specifically, how each analysis produces hypotheses, and how one navigates through the data in order to confirm or deny those guesses.