The fun and frustration of seeking context

Numbers Rule Your World 2022-03-10

This is a third post related to an exploratory analysis of electricity consumption data from California, published at StackOverflow. The previous two posts are here and here.

***

Exploratory analysis always contains twists and turns. The linear presentation shown at StackOverflow likely reflects an ex-post simplification that dropped deadends, U-turns, etc.

There are two points in this journey at which I'd have taken a different turn.

The first moment relates to the peak consumption day in 2020:

Stackoverflow_googlesearch

The analyst discovered that on August 19, between 4 and 6 pm, there was a peak in usage. The analyst then looked for context (a good idea!). This leads to the following sentences:

A quick Google search for “California Aug 19th 2020” shows that the region was suffering from wildfires, so perhaps people kept windows closed and the AC on instead of opening their windows to the cooler nighttime air. September 6 also shows up among the highest values, and a search indicates a likely cause: a record-breaking heat wave in California that hit the national news while the fires continued to burn.

If one wants to make this a key conclusion of the analysis, much more work is needed to validate the hypothesis. Heatwaves and wildfires are events that affect a region for a duration of time. It would be interesting to see the trend in energy use in that whole time window. Also, a top N google search is risky when trying to explain extreme values. Extreme values are frequently caused by extreme events, and it's not clear that such an extreme event may be picked up as one of the top N by search engines.

It's the phrase "quick Google search" that doesn't sit well with me. This part of the analysis moves from descriptive insights to cause--effect, and a "quick" search is not going to do the job!

***

The second moment at which I'd have taken a different turn concerns this scatter plot:

Stackoverflow_scatterplot

The analyst has now switched to a Texas dataset. The picture plots the relationship between electricity consumption and temperature. Each dot in the scatter plot is an hour.

I have seen a lot of similar analyses and they all suffer from an ecological fallacy. It's the problem of over-aggregation, ironically, it's the topic of the StackOverflow blog post, but on the spatial dimension rather than the temporal dimension.

Each temperature value is the average temperature across all of Texas, and each energy consumption value is the average usage across all Texans - for a specific hour of a specific day.

If there is a pattern that affects all (or most) of Texas localities in the same direction (and similar magnitude), then this chart will faithfully show the pattern. However, most interesting patterns are likely going to affect different localities in different directions/magnitudes, and so the above chart would have "aggregated away the signal", to use their words.

Just imagine a single dot which is a single hour of a day, and focus on the temperature dimension only. At any moment in time, the dot represents the average temperature across all weather stations in Texas, but the variability of temperature would be quite large. Some localities will experience higher than normal temperatures while others will show lower than normal. The weights of these localities on average Texas energy consumption is a factor lurking behind these numbers. The point is that the above scatter plot is not useful.

The lesson of the StackOverflow blog post is that instead of this aggregated scatter plot, the analyst should disaggregate the spatial dimension, and look for interesting patterns there.