Small Data and Big Data Should be Best Friends

Current Berkman People and Projects 2013-03-30

There is a scholarly politics to research methods that I find baffling. Some people consider qualitative research methods—like interviews and observations—to be squishy or lacking rigor. Some folks find quantitative methods—like social network analysis or econometrics— to be myopic or dehumanizing. In different departments or schools, certain methods are held in esteem and others in disdain. It seems completely self-evident to me that all of these forms of inquiry, when applied carefully and thoughtfully to the right questions, can lead to insight. Calling one set better than another is like calling chocolate ice cream better than vanilla ice cream; of course some people have a personal preference, but the world would be deeply incomplete without either.

Some of the "method wars" that have been fought in the past surface again in the presence of "big data." I think that big data and small data should be besties, and thus I rise in defense of methodological pluralism.

One argument at the heart of methodological pluralism is the idea that any form of inquiry has limits. In grossly oversimplfying terms, qualitative methods are oft accused of lacking generalizability and quantiative methods are found to lack granularity. So, a more perfect view of the world can be found if we each bring our different methodological lenses together to various issues, and let the truest picture emerge from many lenses, each of which brings different elements into focus.

This is a position that danah boyd and Susan Crawford take in their recent widely circulated paper, Six Provocations for Big Data:

Finally, in the era of the computational turn, it is increasingly important to recognize the value of 'small data'. Research insights can be found at any level, including at very modest scales. In some cases, focusing just on a single individual can be extraordinarily valuable. Take, for example, the work of Tiffany Veinot (2007), who followed one worker - a vault inspector at a hydroelectric utility company - in order to understand the information practices of blue-collar worker. In doing this unusual study, Veinot reframed the definition of 'information practices' away from the usual focus on early-adopter, white-collar workers, to spaces outside of the offices and urban context. Her work tells a story that could not be discovered by farming millions of Facebook or Twitter accounts, and contributes to the research field in a significant way, despite the smallest possible participant count. The size of data being sampled should fit the research question being asked: in some cases, small is best.

An even stronger position for methodological pluralism suggests that not only are diverse methods necessary to complement the limitations of each other, but that different forms of methodological inquiry can actually improve our insights from other forms. You don't just need big data because it allows for findings that you can't get from small data; you need big data because it can make small data methods better, and vice versa.

In my own research, I've found that small data insights are essential for making meaning out of patterns in big data. But I want to spend a few moments on another phenomena from my research, how big data improved my use of small data methods. In my research (for those new to this blog), I examined a population of over 200,000 publicly-viewable, education-related wikis, closely examined several randomly drawn samples of hundreds of wikis from this set, surveyed thousands of teachers, and interviewed and observed dozens of educators and students. Here are four ways that having access to the hundreds of thousands of learning environments (big data) improved or informed my work with surveys, interviews, and observations (small data).

Sampling More Effectively

Education technology research often relies on the study of special populations for case studies, "hot houses" where teachers and students have access to special teachers, resources, supports, etc. Big data can allow better sampling. For instance, in my wiki study, I conducted interviews and observations with teachers randomly sampled from a population of wikis, rather than relying on people I knew, or found online, or other forms of convenience or purposive sampling. For many of my observations, anytime I had occasion to travel to a city (not perfectly exogenous, but still somewhat arbitrary), I'd call up a randomized order lists of wikis from education environments with recent edits and try to arrange a visit. That's not perfect random sampling, but finding sites through that approach brought a wider group into my studies rather than just dropping by the classrooms of people I already knew or were in my networks.

Identifying Similarities and Differences between Samples and Populations

In publishing my work, I was challenged by peer reviewers to explain how the small sample of wikis that I conducted deep content analysis with were representative of the whole population. One of the ways that I could respond was by analyzing my sample and the full population along a host of computationally accessible variables (number of users, number of posts, date of creation, etc.) to demonstrate that on a whole host of observable variables, my sample did not differ significantly from the population. And of course, had I found strange discrepancies, I could have surmised the presence of issues with my sampling that would have been impossible to discern without the big data context to my small data.

Verifying and Contextualizing Self-Report Data

I had a fascinating interview once with a teacher who discussed all the amazing things he was doing with his wiki, which proved to be completely untrue. This is unfortunately not all that uncommon in interview or survey research: so-called "social desirability bias" can mean that people tell you what they think you want to hear. Having full access to real-time data about his wiki allowed me to qualify his self-reported data. Also, by randomly sampling interviewees based on a big dataset, I was able to go into interviews knowing something about the specific interviewees experience with wikis. If I chose, I could ask specific questions based on a teachers activity history revealed by the dataset.

Triangulating among Data Sources

In my wiki study, I conducted a survey among wiki-using teachers. Despite a variety of carefully designed incentives, response rates were low. It is impossible to know how my sample of wiki-using teachers differs from the population of wiki-using teachers, since I don't have data on the full population and did not successfully draw a randomized sample. That said, what I can do is examine the wikis created by teachers in my sample, and compare them to the full population of wikis, for which I have data. Again, if differences exist between the wikis in my survey sample and the wikis in the population, then that suggestions a certain kind of response bias in my survey sample.

So there it is friends. Lay down your snoot and snark about people who study things in different ways than you do. Small data and big data are best friends. The researchers who employ these diverse methods should buddy up as well. It's not just that we can learn from each other's approaches; it's that we can make each other better at we do ourselves.

For regular updates, follow me on Twitter at @bjfr and for my papers, presentations and so forth, visit EdTechResearcher.

- Justin Reich