Social media can predict what you’ll say, even if you don’t participate

Ars Technica » Scientific Method 2019-01-22

There have been a number of high-profile criminal cases that were solved using the DNA that family members of the accused placed in public databases. One lesson there is that our privacy isn't entirely under our control; by sharing DNA with you, your family has the ability to choose what everybody else knows about you.

Now, some researchers have demonstrated that something similar is true about our words. Using a database of past tweets, they were able to effectively pick out the next words a user was likely to use. But they were able to do so more effectively if they simply had access to what a person's contacts were saying on Twitter.

Entropy is inescapable

The work was done by three researchers at the University of Vermont: James Bagrow, Xipei Liu, and Lewis Mitchell. It centers on three different concepts relating to the informational content of messages on Twitter. The first is the concept of entropy, which in this context describes how many bits are, on average, needed to describe the uncertainty about future word choices. One way of looking at this is that, if you're certain the next word will be chosen from a list of 16, then the entropy will be four (2⁴ is 16). The average social media user has a 5,000-word vocabulary, so choosing at random from among that would be an entropy of a bit more than 12. They also considered the perplexity, which is the value that arises from the entropy—16 in the example we just used where the entropy is four.

Read 12 remaining paragraphs | Comments