The Musk sampling plan, thought through
Numbers Rule Your World 2022-05-17
Elon Musk likes to stir things up, and last Friday he claimed Twitter exaggerated its user statistics by counting "spam" accounts. Of course, Musk has made a bid to buy the social-media company, and he may just be looking for a discount.
He posted a tweet asking for ideas of how to estimate the proportion of spam/fake accounts on Twitter.
Later he suggested "ignore the first 1,000 followers, then pick every 10th".
Let's play this out a bit. When I'm done, you'll realize why sampling is a profession.
***
Each twitter account has a user name (mine is @junkcharts), and no two users can have the same user name. Each user has two basic counts: followers and following. The typical user has more following than followers - that would be true of most content consumers. If one is a content producer, the opposite is hopefully true: more followers than following. Elon Musk, for example, has over 90 million followers but he only follows 100 or so accounts.
So what happens when the first 1,000 followers are ignored, as Musk stipulated? Presumably he believes the earlier followers of any account are unrepresentative. It's not clear this exclusion does what he intends. The exclusion rule won't affect his own account as 1,000 is a rounding error. The set of Musk followers - with or without the first 1,000 - is essentially the same.
Applying this exclusion to other people's accounts may be problematic because 1,000 followers is a pretty high bar. 2838, 70, 133, 564, 3432, 608, 16K, 10K, 982, 4648. Those are the follower counts for the first 10 accounts shown on Musk's feed. Half of those accounts have fewer than 1,000 followers. After removing the first 1,000 from these smaller accounts, there is no follower left to pick from. The next three accounts have between 3000-5000 followers; excluding a third to a fifth of their followers from sampling seems severe. Thus, the Musk sampling plan introduces a large-account bias.
***
The second step of the Musk sampling plan is described as pick every tenth follower. So, take Musk's 90 million followers minus the first 1,000, and we'll get a gigantic sample of 9 million.
Here comes the hard part. What he wants to know is the proportion of those 9 million accounts that are spam accounts. Who is going to decide whether each of the 9 million accounts are "spam", and how?
Nine million is too big a sample to handle. He may be applying the "rule of thumb" that a random sample should be 10% of the population. That's a myth busted in any intro Stats class. We typically only need to interview 1,000 Americans to generalize the sample responses to the entire population of 300 million, which is far, far, far, far smaller than 10%.
The required sample size is not a fixed number. If one can tolerate a larger margin of error, the sample can be reduced. What's our tolerable margin of error? If the true proportion of spam accounts is 5%, as Twitter management asserted, we may want a rather precise estimate, within plus or minus 2%. Reviewing Stats 101, we learn that translates to a standard error of 1%, and a sample size of 475.
What that means is if we take a random sample of 475 of Musk's followers, and learn that 5% (24) of those are spam accounts, then we can conclude that 5% of his followers are spam accounts, plus or minus 2%. (Others may disagree.)
It would be much easier to verify 500 accounts than 9 million.
***
Perhaps Musk didn't intend to pick every 10th follower until the list of 90 million is exhausted. What he might have in mind is to pick every 10th until the required sample size is reached. Let's say we want a sample of 500, so we stop the random sampling procedure after 500 picks, or having gone through 5,000 followers.
Can you see what's wrong with this? In effect, rather than being randomly picked from the entire list of followers, the 500 all came from a narrow slither of 5,000 followers, precisely followers #1001 to #6001 after expelling the first 1,000. But... the list of followers is ordered not at random, but chronologically. Thus, all 500 followers in the sample would have started following Musk during a specific narrow time window.
Musk appears to be imagining that he has a list starting with the earliest followers through to the most recent. (In practice, it may be easier to obtain the reverse chronological list starting with the most recent.)
***
Notice that the sampling plan involves two steps: first sample from all users, then within each selected user, sample from followers. As stated, the Musk plan is incomplete, as we have no idea what he plans to do for step 1. Aside from his own account, what other users does he include, and how many?
Even if the first step involves a random sample from all users, this unbiased sample is later tainted by the exclusion of the first 1,000 followers, because it flushes out small accounts from the sample.
The first step is unlikely to be a simple random sample; some intentional bias may be inserted. Like most social-media platforms, I'm surmising there is a highly skewed distribution of followers. Maybe 10 percent of the biggest accounts attract 90 percent of all followers. Thus, a simple random sample will contain lots of small accounts with few followers, not an ideal situation given the exclusion rule that subsequently flushes out small accounts. (Bias is not always a bad thing!)
As explained above, a sample size of 475 followers gives a good estimate of the proportion of spam accounts among Musk's followers; I'd be hesitant to generalize it to other twitter accounts. That's because Elon Musk's is no ordinary user.
Combining the spam proportion in Musk's account with similar in other selected accounts is tricky. If intentional bias is used in the first step, then a weighted average is required. Weighting does not cure the large-account bias though.
The two-step sampling seems unnecessary. Why not just sample users directly? After all, a follower must be a user.
***
The above considerations form a starter's list.
Here are further topics:
- If one wants a random sample, then every user should have equal probability of being picked. Such a sample is likely to contain a majority of small accounts.
- Some users are lurkers. They have few followers.
- Some accounts have been abandoned but not deleted.
- Newer accounts have fewer followers and following.
- One person can have multiple twitter accounts. Spammers probably have a higher average number of accounts than humans
- The more accounts a user follow, the more likely the user will appear in the random samples.
- A follower may be one of the first 1,000 followers of Elon Musk and thus excluded but the same person will be post-1000th followers of other accounts and included if randomly selected.
- Are all bot accounts "spam"? What is the definition of "spam"?
- All social media have tons of spam accounts. Is it better to answer the inverse question of how many humans use Twitter?