Know your data 34: the best defense is offense

Numbers Rule Your World 2023-08-22

There really is not much more to say than to print the first sentence of this Bloomberg artcile (link):

Facebook owner Meta Platforms Inc. for years paid a contractor to scrape data from other websites while publicly condemning the practice and suing companies that pulled data from its own social-media platforms.

***

Web scraping is a very strange industry that operates in a shadow. At the end of this post, I hope you'll appreciate why the scrapers are hiding.

Let's trace how data land on websites. I'm going to use an example of baseball statistics in Baseball-Reference.com. Here is how one set of data looks like to human eyes:

The browser does not see it this way. This is what it seems (HTML):

Each line is a row of data and these lines extend far into the right. Here is the next part:

You start to see the numbers embedded among the HTML tags here (49, 27.3, ....)

Not surprisingly, the HTML format is not efficient for storing data. The raw data are stored in databases, and there is a process that extracts the data out of these databases, and transform them to HTML format. The transformation essentially adds formatting to let the browser know how to display the data.

The act of "web scraping" is to grab the HTML-formatted files, then reverse the above process - i.e. strip out the formatting, and reduce the information back to something that looks like an Excel spreadsheet.

***

The mystery is why this reverse process is needed.

What the scraper wants is the original dataset without the formatting. You'd think they can just ask for it directly.

In fact, some websites encourage others to use their data, so they create tools to distribute them. For example, Baseball-Reference.com has a tab called "Share & Export", with these options:

The key point is that all these options place the data directly into the users' hands. There is no scraping, no stripping of tags, etc.

This means that if scraping is necessary, it's usually against the website's wishes. Many websites (apparently including Meta/Facebook) want to "own" their data. These websites typically put up lots of barriers to impede scrapers. The process is not as smooth as the one I described above. For example, they can easily differentiate between a reader browsing the site and a scraper pulling down every page of every table - and they block the latter. They may use a variety of tools to hide the information from the scrapers. (You may have come across some webpages that make it impossible to copy data out of the tables - using Ctrl-C.) They may restrict the number of pages that can be viewed within a time window.

There is a bit of an arms race between scrapers and these data hoarders.

Pretty much all the big tech companies (Meta, Google, Microsoft, LInkedin, Amazon, etc.) don't want scrapers. Twitter was an exception although that is quickly changing. That's why scrapers are operating in a gray area. It may also be seen as rebellious - upholding a belief that data should be free, or that data should be owned by the people who create them.

***

From a technical perspective, data analysts should prefer to take the data directly out of databases, rather than processing scraped files. The reason is that the insertion of formatting is not always a reversible process. Take for example the following excerpt from the table of pitching statistics:

Notice that on line 9 (Chase Anderson), there is a blank space under the column "GmScA." In a proper database, missing values are not blanks but usually represented by symbols like "." or NA. When the data are ingested to the HTML table, those missing indicators are removed and replaced by blanks, which looks better on the browser.

However, the existence of such blank spaces may confuse the scraping tool. As a result, the data might be shifted, as if the blank cell did not exist. It appears easy to spot when I highlight a specific instance like this. But larger datasets may be spread out over dozens if not hundreds of tables, and the missing values can appear anywhere. Another example of an annoying feature of HTML tables is footnote symbols printed next to the data, e.g. 45.6^#. Such formatting is for browser presentation and would not exist in the underlying database.

Other than working around obstacles planted by websites, the back and forth transformation has little value and may introduce impurities.