When the Metrical Tail Wags the University Dog

Remaking the University 2019-09-07

The rule of infrastructure is that no one thinks about it until it breaks.  This week, I was at the annual conference of the International Society for Scientometrics and Informetrics when I bumped into an example of how the massive flow of bibliometric data can suddenly erupt into the middle of a university's life.">Washington University in St. Louis (WashU) has a new chancellor.  He hired a consulting firm to interview and survey the university community about its hopes, priorities, views of WashU's culture, and desired cuts.  The firm found a combination of hope and "restlessness," which the report summarized like this: members of the WashU community
want to see a unifying theme and shared priorities among the various units of the university. Naturally, stakeholders want to see WashU rise in global rankings and increase its prestige, but they want to see the institution accomplish this in a way that is distinctively WashU. They are passionate about WashU claiming its own niche. Striving to be “the Ivy of the Midwest” is not inspiring and lacks creativity. Students feel the university is hesitant to do things until peer institutions make similar moves.
"As always, the university needs to become more unique and yet more unified, and to stop following others while better following the rankings. The report might have gone on to overcome these tensions by organizing its data into a set of proposals for novel research and teaching areas.  Maybe someone in the medical school suggested a cross-departmental initiative on racism as a socially-transmitted disease. Maybe the chairs of the Departments of Economics and of Women, Gender, and Sexuality Studies mentioned teaming up to reinvent public housing with the goal of freeing quality of life from asset ownership.  These kinds of ideas regularly occur on university campuses, but are rarely funded. New proposals is not what the report has to offer. It generates lists of the broad subject areas that every university is also pursuing (pp 4-5). It embeds them in this finding:
The other bold call to action that emerged from a subset of interviewees is internally focused. This subset tended to include new faculty and staff . . . and Medical School faculty and senior staff (who perceive their [medical] campus enforces notably higher productivity standards). These interviewees are alarmed at what they perceive as the pervading culture among faculty on the Danforth Campus [the main university]. They hope the new administration has the courage to tackle faculty productivity and accountability. They are frustrated by perceived deficiencies in research productivity, scholarship expectations and teaching quality. A frequently cited statistic was the sub-100 ranking of WashU research funding if the Medical School is excluded. Those frustrated with the Danforth faculty feel department chairs don’t hold their faculty accountable. There is too much “complacency” and acceptance of “mediocrity.” “There is not a culture of excellence.” . . . Interviewees recognize that rooting out this issue will be controversial and fraught with risk. However, they believe it stands as the primary obstacle to elevating the Danforth Campus –and the university as a whole –to elite status. 
Abstracting key elements gets this story: One group has a pre-existing negative belief about another group.  They think the other group is inferior to them. They also believe that they are damaged by the other's inferiority.  They offer a single piece of evidence to justify this sense of superiority. They also say the other group's leaders are solely responsible for the problem.  They have theory of why: chairs apply insufficient discipline. They attribute the group's alleged inferiority to every member of that group.  Stripped down like this, this part of the report is boilerplate bigotry.  Every intergroup hostility offers some self-evident "proof" of its validity.  In academia's current metrics culture, the numerical quality of an indicator supposedly cleanses it of prejudice.  Lower research expenditures is just a fact, like the numbers of wins and losses that create innocent rankings like baseball standings.  So, in our culture, the med school can look down on the Danforth Campus with impunity because it has an apparently objective number--relative quantities of research funding. In reality, this is a junk metric.  I'll count some of the ways: 
  • the belief precedes the indicator, which is cherry-picked from what would be a massive set of possible indicators that inevitably tells a complicated story.  (A better consultant would have conducted actual institutional research, and would never have let surveyed opinions float free of a meaningful empirical base.) 
  • the indicator is bound to Theory X, the a priori view that even advanced professionals “need to be coerced, controlled, directed, and threatened with punishment to get them to put forward adequate effort" (we've discussed Theory X vs. Theory Y here and here). 
  • quantity is equated with quality. This doesn't work--unless there's a sophisticated hermeneutical project thatt goes with it.  It doesn't work with citation counts (which assume the best article is the one with the most citations from the most cited journals), and its use has been widely critiqued in the scientometric literature (just one example). Quantity-is-quality really doesn't work with money, when you equate the best fields with the most expensive ones. 
  • the metric is caused by a feature of the environment rather than solely by the source under study. The life sciences get about 57 percent of all federal research funding, and the lion's share of that runs through NIH rather than NSF, meaning through health sciences more than academic biology. Thus every university with a medical school gets the bulk of its R&D funding through that medical school; note medical campuses dominating R&D expenditure rankings, and see STEM powerhouse UC Berkeley place behind the University of Texas's cancer center. (Hangdog WashU is basically tied with Berkeley.)
  • the report arbitrarily assumes only one of multiple interpretations of the metric. An alternative interpretation here is (1) the data were not disaggregated to compare similar departments only, rather than comparing the apple of a medical school to the orange of a general campus (with departments of music, art history, political science, etc.)  Another is (2) the funding gap reflects the broader mission of arts and sciences departments, in which faculty are paid to spend most of their time on teaching, advising, and mentoring.  Another is (3) the funding gap reflects the absurd underfunding of most non-medical research, from environmental science to sociology.  That's just three of many.
    • the metric divides people or groups by replacing relationships with a hierarchy. 
    • This last one is a subtle but pervasive effect that we don't understand very well.  Rankings make the majority of a group feel badly that they are not at the top. How much damage does this do to research, if we reject Theory X and see research as a cooperative endeavor depending on circuits of intelligence?  Professions depend on a sense of complementarity among different types of people and expertise--she's really good running the regressions, he's really good with specifications of appropriate theory, etc. The process of "ordinalizing" difference, as the sociologist Marion Fourcade puts it, discredits or demotes one of the parties and can this spoil professional interaction.  Difference becomes inferiority.  In other words, when used like this, metrics weaken professional ties in an attempt to manage efficiency.
    So if Washington University takes these med school claims literally as fact, and doesn't scramble to see them as expressions of a cultural divide that must be fixed, the faulty metric just killed their planning process. Let's take a step back from WashU.  The passage I've cited does in fact violate core principles of  professional bibliometricists. They reject these kinds of "simple, objective" numbers and their use them as a case-closed argument.  Recent statements of principle all demand that numbers be used only in the context of qualitative professional judgment: see DORA, Metric Tide, Leiden, and the draft of the new Hong Kong manifesto. It's also wrong that STEM professional organizations are all on board with quantitative research performance managment. Referring to the basic rationale for bibliometrics, "that citation statistics are inherently more accurate because the substitute simple numbers for complex judgements"--it was the International Mathematicians Union that in 2008 called this view "unfounded" in the course of a sweeping critique of the statistical methods behind Journal Impact Factor, the h-index, and other popular performance indicators. These and others have been widely debated and at least partially discredited, as in this graphic from the Leiden Manifesto:
    The Leiden and Hong Kong statements demand that those evaluated be able to "verify data and analysis."  This means that use, methods, goals, and results should be reviewable and also rejectable where flaws are found.  All bibliometricists insist that metrics not be taken from one discipline and applied to another, since meaningful patterns vary from field to field.  Most agree that arts and humanities fields are disserved by them. In the U.S., new expectations for open data and strictly contextualed use were created by the Rutgers University faculty review of the then-secret use of Academic Analytics.
    The best practitioners know that the problems with metrics are deep. In a Nature article last May,  Paul Wouters, one of the authors of the Leiden manifesto, wrote with colleagues,
      Indicators, once adopted for any type of evaluation, have a tendency to warp practice5. Destructive ‘thinking with indicators’ (that is, choosing research questions that are likely to generate favourable metrics, rather than selecting topics for interest and importance) is becoming a driving force of research activities themselves. It discourages work that will not count towards that indicator. Incentives to optimize a single indicator can distort how research is planned, executed and communicated.
      In short, indicators founder over Goodheart's Law (308), which I paraphrase as, "a measure used as a target is no longer a good measure."  Thus the Leiden manifesto supports the (indeed interesting and valuable) information contained in numerical indicators while saying they should be subordinated to collective practices of judgment.  
      Given widespread reform efforts, including his own, why, in May, did Wouters lead-author a call in Nature to fix bad journal metrics with still more metrics, this time measuring at least five sub-components of every article?  Why does Michael Power's dark 1990s prediction in The Audit Society still hold: failed audit creates more audit?  Why are comments like those in the WashU report so common, and so powerful in academic policy? Why is there a large academic market for services like Academic Analytics, which sells ranking dashboards to administrators precisely so they can skip the contextual detail that would make them valid? Why is the WashU use of one junk number so typical, normal, common, invalid, and silencing? What do we do given that we can't criticize one misuse at a time, particularly when there's so much interest in discrediting an opposition with them? One clue emerged in a book I reviewed last year, Jerry Z. Mueller's The Tyranny of Metrics. Mueller is an historian, and an outsider to the evaluation and assessment practices he reviewed.  He decided to look at how indicators are used in a range of sectors -- medicine, K-12 education, the corporation, the military, etc.--and to ask whether there's evidence that metrics cause improvements of quality. Mueller generates a list of 11 problems with metrics that most practitioners would agree with.  Most importantly, while they emerged when metrics were used for audit and accountability, they were less of a problem when used by professionals within their own communities.  Here are a couple of paragraphs from that review:
      Muller’s only causal success story, in which metrics directly improve outcomes, is the Geisinger Health System, which uses metrics internally for improvement. There ‘the metrics of performance are neither imposed nor evaluated from above by administrators devoid of firsthand knowledge. They are based on collaboration and peer review’. He quotes the CEO at the time claiming, ‘Our new care pathways were effective because they were led by physicians, enabled by real‐time data‐based feedback, and primary focused on improving the quality of patient care’ (111). At Geisinger, physicians ‘who actually work in the service lines themselves chose which care processes to change’.
      If we extrapolate from this example, it appears that metrics causally improve performance only when they are (1) routed through professional (not managerial) expertise, as (2) applied by people directly involved in delivering the service, who are (3) guided by nonpecuniary motivation (to improve patient benefits rather than receive a salary bonus) and (4) possessed of enough autonomy to steer treatment with professional judgment.
      I'd be interested to know how the bibliometrics community would feel about limiting the use of metrics to internal information about performance with these four conditions.  Such a limit would certainly have helped the WashU case, since the metric of research expenditures could be discussed only within a community of common practice, and not applied by one (med school) group to another (Danforth Campus) in demanding accountability. Another historian, John Carson, gave a keynote address at the ISSI conference that discussed the historical relation between quantification and scientific racism, calling for "epistemic modesty" in our application of these techniques.  I agree.  Though I can't discuss it here, I also hope we can confront our atavistic association of quality with ranking, and of brilliance with a small elite.  The scale of the problems we face demands it. In the meantime, don't let someone use a metric you know is junk until it isn't.