Anonymizing data on the users of Wikipedia
Élan Vital 2018-07-30
Wikipedia currently tracks and stores almost no data about its readers and editors. This persistently foils researchers and analysts inside the WMF and its projects; and is largely unnecessary.
Not tracked last I checked: sessions, clicks, where on a page readers spend their time, time spent on page or site, returning users. There is a small exception: data that can fingerprint a user’s use of the site is stored for a limited time, made visible only to developers and checkusers, in order to combat sockpuppets and spam.
This is all done in the spirit of preserving privacy: not gathering data that could be used by third parties to harm contributors or readers for reading or writing information that some nation or other powerful group might want to suppress. That is an essential concern, and Wikimedia’s commitment to privacy and pseudonymity is wonderful and needed.
However, the data we need to improve the site and understand how it is used in aggregate doesn’t require storing personally identifiable data that can be meaningfully used to target editors in specific. Rather than throwing out data that we worry would expose users to risk, we should be fuzzing and hashing it to preserve the aggregates we care about. Browser fingerprints, including the username or IP, can be hashed; timestamps and anything that could be interpreted as geolocation can have noise added to them.
We could then know things such as, for instance:
- the number of distinct users in a month, by general region
- how regularly each visitor comes to the projects; which projects + languages they visit [throwing away user and article-title data, but seeing this data across the total population of ~1B visitors]
- particularly bounce rates and times: people finding the site, perhaps running one search, and leaving
- the number of pages viewed in a session, its tempo, or the namespaces they are in [throwing away titles]
- the reading + editing flows of visitors on any single page, aggregated by day or week
- clickflows from the main page or from search results [this data is gathered to some degree; I don’t know how reusably]
These are just rough descriptions — great care must be taken to vet each aggregate for preserving privacy. but this is a known practice that we could do with expert attention..
What keeps us from doing this today? This may be discussed somewhere, but past discussions I recall were brought to an early end by [devs worrying about legal] or [legal worrying about what is techincally possible]. It seems to me this would help tremendously in improving our understanding of the projects, our participants, and their experience of the wikis. Without that we’re rather in the dark. And we would draw in many great sociologists, historians of knowledge, and data scientists (as outside researchers, perhaps as staff) by having richer material for them to work with.