Science As Pull Requests

2018-12-26 5 min read science

I recently discovered an article about the way AirBnB manages its knowledge. In “Scaling Knowledge”, Chetan Sharma and Jan Overgoor describe the five key tenets they try to follow in their data science research:

Reproducibility — There should be no opportunity for code forks. The entire set of queries, transforms, visualizations, and write-up should be contained in each contribution and be up to date with the results.

Quality — No piece of research should be shared without being reviewed for correctness and precision.

Consumability — The results should be understandable to readers besides the author. Aesthetics should be consistent and on brand across research.

Discoverability — Anyone should be able to find, navigate, and stay up to date on the existing set of work on a topic.

Learning — In line with reproducibility, other researchers should be able to expand their abilities with tools and techniques from others’ work.

The way they operationalize these tenets is very intriguing:

“At the core there is a Git repository, to which we commit our work. Posts are written in Jupyter notebooks, Rmarkdown files, or in plain Markdown, but all files (including query files and other scripts) are committed.”

and

“To prevent [low quality research], our process combines the code review of engineering with the peer review of academia, wrapped in tools to make it all go at startup speed. As in code reviews, we check for code correctness and best practices and tools. As in peer reviews, we check for methodological improvements, connections with preexisting work, and precision in expository claims.”

When I read the article, all I could think was “wait…isn’t this how science should always work?”. If you think about it, the current publication process in academia is like a crude, paper-based version of this: Someone investigates something, sends a draft (pull request) to the editor (maintainer) of a journal (git repo), who initiates the peer review (code review). Adjustments get made (further commits) and the article is published in the journal (merged into the body of knowledge of the field).

But as we all know, the current process is riddled with problems. It’s expensive and unfree, riddled with bias, intransparent and way too often completely non-reproducible.

Sean Illing recently said that the locking up of research behind paywalls has lead academia to marginalize itself and Patrick Collison asks what the successor to the scientific paper and the scientific journal is going to be. I think that a git-based model of science could not only be a successor to the scientific paper as we know it and make science more accessible, but also solve all other problems scientific publishing currently has in the process. Let me explain how that could look like:

Leaving issues of governance aside for a moment, a group of scientists in a given field or sub-field could start a public Git repository. I work in political science, so let’s call it the “Open Repository of Political Science” (OPSR).

Now, every paper would be a commit to the master branch of OSPR. Because it’s an open git repository everything would be inherently transparent. You could see every change, every discussion, every step in the process. You could even reference papers by their commit ID, solving the problem of badly formatted bibliographies. And everyone would be able to download the complete knowledge of the field by simply cloning the repo.

One problem of this model, however, would indeed be the transparency. In programming, discussing pull requests is based purely on merit: either the code runs and passes all tests, or it fails. But in science things are not always so clear cut. That is why we have double-blind (and hopefully soon tripple-blind) peer review: we want to eliminate all potential bias in acceptance based on who the author is. How would we solve this?

My idea would be to hide the identity of the author, but keep the identity of editor and reviewers transparent. This would eliminate bias against the author, but create incentives for constructive criticism on the side of the reviewers and more critically, allow them to gain reputation in the field by providing good feedback. I’ll return to reputation in a minute.

How would we hide the identity of the author? Every pull request would be made from an anonymous account, and the initial commit of a paper could include an encrypted file that contains the information for each author. That way, once the pull request is merged into master (i.e. the paper is accepted for publication), the author could de-anonymize the paper by decrypting the file, thus proving their ownership.

Now, one problem in current scientific publishing is the mix of incentives created by relying on published articles not only for the new insights they produce, but also as a measure of reputation for their author. The number of articles written and where they are published is incredibly important for every scientists career, from their tenure decision to funding. So how would this model treat this aspect?

Tackling the number of published articles is easy: switch it for the number of accepted pull requests and you’re set. Tracking citations would also be very simple: count the number of times a commit ID is mentioned in other commits (articles). But you would also be able to add another measure to track a researchers contribution to the field: their participation in the discussion of pull requests. Currently this is work that is unseen and un-rewarded. Taking the work into the open would change that.

Using the pull-request model of science we could also easily incorporate pre-registration: the first commit of every pull-request could contain the pre-registration information as well as the encrypted author information. After open discussions for the pull request and a decision to accept the PR based on this discussion, the researcher could proceed and then publish the results as a new commit to the same pull request.

And, finally: for quantitative analyses every repository could implement continuous integration procedures that automatically check whether the analysis actually returns the results mentioned in the paper, whether the data is structured appropriately etc.

I think there is a lot to this model, and every field in science could benefit from it. I’m very interested in what others think about this process, so please send me an email or contact me on Twitter!

science publishing

Science As Pull Requests

Lukas Kawerau

Data Engineer, Writer