The tyranny of metrics in science evaluation

Here’s a little-told story about the origins of Google. Larry Page and Sergey Brin were inspired by research in scientometrics on how to judge scientific influence via the use of citations to develop the PageRank algorithm. It ranks websites to show the most relevant first in search results. They were quite pleased with the results. However, as Google grew in prominence and early placement in their search results became linked to monetary gain, things started going awry. People started to make fake web pages with links to their product to increase the ranking of their website, trading back-links with other websites or for money. Soon, their tool was subverted and rankings became more like competitions in who could game the system in the best way. Page and Brin were at a loss. Luckily, they had another inspiration from the scientometrics world. Grants and papers were often manually rated by a tiny set of reviewers who did this work for free and agreed with each other so little as to give the whole process a fun lottery component. Held back by rather quaint ideas about the ethical treatment of human labourers, Page and Brin did not want to subjugate scientists to slave away rating each and every website for free, so they switched to pigeons. And this is why we still use Google for search today, because its founders eschewed automated algorithmic metrics, instead relying on beak-ual curation of web pages by expert birds.

But, what are birds?

Unfortunately, the scientific community, despite providing the initial inspiration for both link-based influence analysis and pigeon training, has fallen behind their innovations. Until now. In the open science community, complaints about hiring and promoting people based on metrics such as the Impact Factor, the h-index, and the REF are commonplace. They are gameable, promote a focus on quantity over quality, are symptomatic of a science culture where we’re all too busy to read one another’s work, and so on.

Has the time for the big switch to pigeons finally come?

Actually, probably no. Although scientific publication counts are not rising at the same rate as web pages, the problem of influence analysis is becoming ever more astounding all the same. Job applicants for tenured positions routinely number in the hundreds. And I’d pity the birds that would have to get to work every time I check my h-index.

We want to question the common conviction that qualitative assessment is best. Reviewers are human and hence have human biases about sex, race, age, and whether a job applicant should really be sporting such a shirt that might arguably be more unjust than some forms of algorithmic bias. Moreover, even reviewers instructed to pay attention to whether a candidate work is robust, open, and replicable, may be distracted by shiny name-brand journals and large grant sums in the CV. Especially so if the latter information is easily accessible and prominent, while the other information takes time, effort, and knowledge to gather. When trying to convince your department not to focus on the Impact Factor and h-index, do you really want to suggest they do more unrewarded work? How has that worked out for you in the past?

We think we need new metrics that actually measure what we want in science. Refusing to do so because of some starling-eyed dream about working birds will leave the battlefield to the existing contenders, which we all know suck. So, we came up with a list of questions that might govern how to pick such a new metric.

A couple of worker ducks evaluating job applicants.

What to do about measures being gamed, as soon as they become targets?

Measuring self-citation using the patented 100% CI folding rule

This adage, often called Goodhart’s law, is evidently a big problem to worry about. Google is constantly trying to deter link spammers and other black hat search engine optimization. Even the BigMac index has been gamed.

Google’s answer to this is a secretive system with numerous penalties and other ways of punishing cheaters. Obviously, secrecy in ranking runs counter to the idea of a fair, transparent science. Yet, given that there are still fewer papers than web pages, and usually some initial human filtering, maybe there are some SEO tricks we need not worry about. Although self-citation, self-plagiarism and citation trading is rife, I have not yet seen a paper that is just the words groundbreaking paradigm shifter linked to Robert Sternberg’s website repeated endlessly.

Evolving metrics

We think one major answer to this question is that metrics have to evolve, together with the strategies designed to exploit them. Given that such metrics should take the shape of a website, updating the metric should be easier than updating previously published papers to include more mentions of paradigm shifts. This asymmetry makes things a little easier for science than for Google.

Penalties

Current metrics at most exclude self-citations from consideration or implicitly penalise certain legitimate behaviours in the name of making cheating harder. However, that’s obviously unfair too. Some self citations are legitimate indicators of a research career that builds on previous results. Nonetheless, excessive self citation is cheating. It should hurt your score more than moderate self citation can occasionally benefit it. The same goes for the h-index. It primitively punishes people who publish a plethora of papers few people peruse. But at the cost of punishing people who author few, highly influential papers. Who ever said that a consistent citation rate is the mark of the good scientist?
Without real, explicit penalties, the ranks will also not re-shuffle enough for the usefulness of a new index to be noticeable (because performance on most trackable metrics will get correlated via rich-get-richer dynamics).

How to handle preprints, open data, and preregistrations?

Arguably, this is all information that can be tracked in the citation graph. Published a frequently re-used dataset? This should count. However, simple counts for preregistrations and preprints are easily inflated. But why not experiment with a factor that increases your score more for highly cited preregistered studies or those where data is linked and open?

How to count influence on Wikipedia, popular books, and news articles?

Arguably, this should count too. Altmetrics have been fairly focused on Twitter, Facebook, and the news, which are very fickle and where fake influence is easily generated using bought followers. But given that Wikipedia is a source of knowledge both for scientists and for laypeople, that is fairly hard to game, an article’s citation therein (perhaps indirectly, via a review) may be a useful mark of influence.

How to handle vicarious influence?

Was your paper the cornerstone of the argument in an influential review? Well, bad luck, if you didn’t also write the review. A new metric should not privilege review articles, instead highly cited review articles should increase the score for the papers cited therein. This is a notable aspect of PageRank, and also the eigenfactor project. This would also reward people who share datasets, software, and scripts that are actually reused more than people who don’t make their work easily reusable, which is hard to quantify.

How many scores?

Should we limit ourselves to scoring who is “the best scientist”? Arguably, since priorities for job decisions differ, we might consider having several scores indicating aspects such as transparency, team efforts, solo efforts, strength as a project lead, or first author, field-specific expertise, and so on.

What to rank?

We think new metrics should rank individual papers, institutions (universities, departments, journals, …), and people, using a consistent system. This is something where most existing improvements fall short. The eigenfactor can only be (easily) used to assess journals, although a publication exists that applies it to authors on the old preprint repository SSRN. Altmetrics are only for individuals (Impactstory) and articles.

Where should it be available?

Online, for free, obviously. Most proposed, improved indices do not come with a website. Is this lack of confidence on the side of the authors, or lack of funding, or something other?

Why doesn’t this exist?

Probably not because nobody any ever had these obvious ideas, but because the citation graph across papers is not open. However, with Sci-Hub covering almost the entire scientific literature, maybe it’s finally possible for someone to create this public good. Make it happen, nerds! (ideally before I have to apply for my next job).