Beware of the hidden influences

Numbers Rule Your World 2023-07-15

Willian-justen-de-vasconcellos-remote=sm

On the dataviz blog, I recently posted about Wall Street Journal's graphic about U.S. remote workers. See the post here. The original article is securely locked down behind the WSJ paywall, here.

Wsj_remotework_byyear

I identified a key issue with the dataviz, which is the weak connection between the Question and the Visual, using the Trifecta Checkup framework. In this post, I discuss potential problems in the D(ata) corner: how they developed the message from survey data.

Here is the primary insight, as expressed by WSJ:

Workers overall spent an average of 5 hours and 25 minutes a day working from home in 2022. That is about two hours more than in 2019, the year before Covid-19 sent millions of workers scrambling to set up home oces, and down just 12 minutes from 2021, according to the Labor Department’s American Time Use Survey.

***

The focus of the analysis are the hours of remote work, and the object of analysis is the average number of hours worked on an average day.

As I've said in my books (link), it's the stuff that doesn't go into the statistics that we should pay attention to. A big missing piece of the puzzle here is the total amount of time worked (both remote and not).

Think about the average number of hours worked. This number is not well-defined. Let's trace back how such data are collected. It's based on a survey, the Time Use Survey. The respondents to this survey are asked how many hours they worked, and how many of those hours are remote working. So far so good.

Now think about those who are unemployed, were not working at the time of the survey. Their answer to both questions was zero. Is the average number of hours worked based on the entire set of respondents? Or is it based on the subset of respondents who worked greater than zero hours?

If I had to take a bet, it's the latter. Anyone collecting this data is likely to prefer a larger number, which indicates a better economy. Presenting the number of hours worked among those who worked can even be defended as easier to interpret since it does not confound two factors: unemployment level, and work level.

However, after separating the two factors, it's easy to mislead by presenting just one.... which is what WSJ did here.

(if you don't believe me, think about when the bosses would ask one of the most popular and most annoying questions posed to data analysts of all time - "Are you sure you don't have bad data?", do you think the bosses ask this question when the numbers look good or when they look bad?)

***

According to the WSJ, the average number of remote work hours went up from 2 hours 25 min in 2019 to 5 hours 25 min in 2022. That's almost a doubling.

The real impact is likely even bigger if we add back the missing pieice - the amount of work NOT done remotely.

I'm guessing that prior to the pandemic, the 2 hours 25 min represents a small portion of the average amount of hours worked (remote or onsite) but after the pandemic, the 5 hours 25 min is a greater portion of the total work hours. In other hours, the proportion of total work hours that is remote should have gone up handily.

Let see if I can find the data. Here it is (from the official Time Use Survey site):

Redo_junkcharts_wsjremotework_3

Importantly, one of the data series I pulled down is called "Percent participating on an avg day - Working at home, Employed, on days worked." Those last three words lead me to believe that my earlier guess - that the remote work statistics are computed on the basis of those who actually worked - is correct. I said "leads me to believe" only because I didn't use the percent participation data; rather, I used the average hours worked per day, and the label on that series was ambiguous: "Avg hrs per day for participants - Working at home, Employed." (Just par for the course for data analysts! The data dictionaries and documentation are always missing important information...)

***

Now, what about the total amount of work (remote or not)? As the orange line above shows, the number was apparently unaffected by the pandemic. A bit of a surprise, but not really.

Since the statistic omits people who stopped working, we're only looking at the amount of work done by those still working. It means that among those who did not lose employment during the pandemic, they worked about the same average hours as before... but a greater proportion of those hours are remote work.

***

Last but not least, the data have a six-month lag, which means we are just now talking about 2022. Because of the one-time shock of the Covid-19 pandemic, and the ongoing recovery, we expect these numbers to be changing rapidly.

Also, the year 2020, when the pandemic started, dealt major disruptions to any survey activities. The footnote on the chart said that the data missed the first few months of the year.