No, I don’t believe that Harvard or MIT are hiding edX data
e-Literate 2014-06-03
Since my Sunday post What Harvard and MIT could learn from the University of Phoenix about analytics, there have been a few comments with a common theme about Harvard and MIT perhaps withholding any learner-centered analytics data. As a recap, my argument was:
Beyond data aggregated over the entire course, the Harvard and MIT edX data provides no insight into learner patterns of behavior over time. Did the discussion forum posts increase or decrease over time, did video access change over time, etc? We don’t know. There is some insight we could obtain by looking at the last transaction event and number of chapters accessed, but the insight would be limited. But learner patterns of behavior can provide real insights, and it is here where the University of Phoenix (UoP) could teach Harvard and MIT some lessons on analytics.
Some of the comments that are worth addressing:
“Non-aggregated microdata (or a “person-click” dataset, see http://blogs.edweek.org/edweek/edtechresearcher/2013/06/the_person-click_dataset.html ) are much harder (impossible?) to de-identify. So you are being unfair in comparing this public release of data with internal data analytic efforts.”
“Agreed. The part I don’t understand is how they still don’t realize how useless this all is. Unless they are collecting better data, but just not sharing it openly, hogging it to themselves until it ‘looks good enough for marketing’ or something.”
“The edX initiative likely has event-level data to analyze. I don’t blame them for not wanting to share that with the world for free though. That would be a very valuable dataset.”
The common theme seems to be that there must be learner-centered data over time, but Harvard and MIT chose not to release this data either due to privacy or selfish reasons. This is a valid question to raise, but I see no evidence to back up these suppositions.
Granted, I am arguing without definitive proof, but this is a blog post, after all. I base my argument on two points – there is no evidence of HarvardX or MITx pursuing learner-centered long-running data, and I believe there is great difficulty getting non-event or non-aggregate data out of edX, at least in current forms.
Evidence of Research
Before presenting my argument, I’d again like to point out the usefulness of the HarvardX / MITx approach to open data as well as the very useful interactive graphics. Kudos to the research teams.
The best places to see what Harvard and MIT are doing with their edX data are the very useful sites HarvardX Data & Research and MITx Working Papers. The best-known research released as a summary report (much easier to present than released de-identified open dataset) is also based on data aggregated over a course, such as this graphic:
Even more useful is the presentation HarvardX Research 2013-2014 Looking Forward, Looking Back. In this presentation, there is a useful presentation of the types of research HarvardX is pursuing.
None of these approaches (topic modeling, pre-course survey, interviews, or A/B testing) look at learner’s activities over time. They are all based on either specific events with many interactions (discussion forum on a particular topic with thousands of entries, video with many views, etc) or subjective analysis on an entire course. Useful data, but not based on a learner’s ongoing activities.
I’d be happy to be proven wrong, but I see no evidence of the teams currently analyzing or planning to analyze such learner data over time. The research team does get the concept (see the article on person-click data):
We now have the opportunity to log everything that students do in online spaces: to record their contributions, their pathways, their timing, and so forth. Essentially, we are sampling each student’s behavior at each instant, or at least at each instant that a student logs an action with the server (and to be sure, many of the things we care most about happen between clicks rather than during them).
Thus, we need a specialized form of the person-period dataset: the person-click dataset, where each row in the dataset records a student’s action in each given instant, probably tracked to the second or tenth of a second. (I had started referring to this as the person-period(instantaneous) dataset, but person-click is much better). Despite the volume of data, the fundamental structure is very simple. [snip]
What the “person-period” dataset will become is just a roll-up of person-click data. For many research questions, you don’t need to know what everyone did every second, you just need to know what they do every hour, day or week. So many person-period datasets will just be “roll-ups” of person-click datasets, where you run through big person-click datasets and sum up how many videos a person watched, pages viewed, posts added, questions answered, etc. Each row will represent a defined time period, like a day. The larger your “period,” the smaller your dataset.
All of these datasets use the “person” as the unit of analysis. One can also create datasets where learning objects are the unit of analysis, as I have done with wikis and Mako HIll and Andres Monroy-Hernandes have done with Scratch projects. These can be referred to as project-level and project-period datasets, or object-level and object-period datasets.
The problem is not with the research team, the problem is with the data available. Note how the article above is referencing future systems and future capabilities. And also notice that none of this “person period” research is referenced in current HarvardX plans.
edX Data Structure
My gut feel (somewhat backed up by discussions with researchers I trust) is that the underlying data model is the issue, as I called out in my Sunday post.
In edX, by contrast, the data appears to be organized a series of log files organized around server usage. Such an organization allows aggregate data usage over a course, but it makes it extremely difficult to actually follow a student over time and glean any meaningful information.
If this assumption is correct, then the easiest approach to data analysis would be to look at server logs for specific events, pull out the volume of user data on that specific event, and see what you can learn; or, write big scripts to pull out aggregated data over the entire course. This is exactly what the current research seems to do.
Learner-Centered Data Analysis Over Time
It is possible to look at data over time, as was shown by two Stanford-related studies. The study Deconstructing Disengagement:Analyzing Learner Subpopulations in Massive Open Online Courses. looked at specific learners over time and looked for patterns.
Mike Caulfield, Amy Collier and Shawaf Halawa wrote an article for EDUCAUSE Review titled Rethinking Online Community in MOOCs Used for Blended Learning that explored learner data over time.
In both cases, the core focus was learner activity over time. I believe this focus is a necessary part of any learning analytics research program that seeks to improve teaching and learning.
What is interesting in the EDUCAUSE article is that the authors used Stanford’s Class2Go platform, which is now part of OpenEdX. Does this mean that such data analysis is possible with edX, or does it mean that it was with Class2Go but not with the current platform? I’m not sure (comments welcome).
I would love to hear from Justin Reich, Andrew Ho or any of the other researchers involved at HarvardX or MITx. Any insight, including corrections, would be valuable.
The post No, I don’t believe that Harvard or MIT are hiding edX data appeared first on e-Literate.


