Standing on the shoulders of programmers
Free and open-source software is growing to be a powerful tool in academic research, helping scientists to collaborate better and work smarter. Achintya Rao investigates how such software is being used in physics research, and its role in the wider open-science movement
Twenty-three thousand. According to computer scientist Katie Bouman, that is how many people were involved in creating the first ever image of a black hole, taken by the Event Horizon Telescope (EHT) in 2019. Not all of these contributors are formally members of the EHT collaboration (whose numbers are in the hundreds) – the vast majority are those who write, maintain and support the free and open-source software tools that the researchers used in their work.
Bouman became the face of the EHT, after a photo of her delighted grin at seeing her work in action went viral. Code that she had written was part of the imaging software pipeline that extracted that famous photograph. But to Bouman, her contribution was only possible courtesy of software that was shared openly. “We would be getting nowhere if we didn’t have these kinds of tools that other people in the community have built up and have made free to use,” she told Physics World from her home in California, US. “We’re very, very thankful for everything that other people have done.”
Across the Atlantic, Suchita Kulkarni, a particle physicist at the University of Graz in Austria, who specializes in phenomenology of dark matter, agrees. “The reason phenomenology works, the reason people can look at exciting things at the Large Hadron Collider,” she says, “is because we have open-source software that is freely available under creative licences.”
Free and open-source software (FOSS) allows users to inspect the code, modify it and redistribute it with few or no restrictions. The “free” in FOSS thus refers to these freedoms, not to monetary cost. This makes FOSS particularly powerful in research, enabling collaboration between scientists working on code that they have modified, and today it is seen as an integral part of the wider open-science movement.
Code contributor
The term “software” can refer either to general user-facing tools and applications or to programs and analysis code used for specific tasks. Seen through this prism, there are at least four kinds of people who contribute to open-source software, with each group having its own motivations and reasons for doing so. First, you have the authors and maintainers – both paid and volunteer – of general-purpose software tools, including programming languages such as Python, Julia and R. Then there are those who write specialized software for certain domains of research, such as the Astropy library for astrophysics, or ROOT and Pythia for particle physics.
Third, you have scientists who write analysis code for their research that they then share openly, enabling others to learn from their work and apply the same code to different analyses. And finally, you have hobbyists, who contribute to open-source code on a one-off or regular basis, to develop their skills or to help maintainers of software they use. The boundaries between these are often fluid, and many maintainers of important software tools started out as hobbyist contributors.
Switzerland-based particle physicist turned software engineer Tim Head began using the open-source Python language as a physics student. According to him, a large part of the appeal of open-source tools is that they are usually distributed at no cost to the user. “The one thing everybody agrees with is that the easier it is for people to start using your software the better. With free and open-source software, you can just start using it today. You don’t have to ask your boss and then the purchasing department and then wait six months for them to negotiate a deal.”
After learning to programme in Python, Head began to use a greater number of open-source tools during his PhD, including “scikit-learn”, a Python library for machine-learning. Following a Twitter exchange with one of the developers of scikit-learn, he attended a workshop in Paris, aiming to make a contribution of his own. “I spent all afternoon fixing two typos in the documentation,” he laughs. But it sparked something in him.
Head got together with his fellow junior researchers on the LHCb collaboration at CERN to set up an open training programme to acquaint new members of the collaboration with some of the internal software tools used for performing physics analyses. Over time, he got involved with Mozilla Open Leadership, a mentorship programme for leading open-source projects, and started open-source projects of his own. Today, Head works as a senior engineer at Skribble, a digital signature platform. He is also a Distinguished Contributor to Project Jupyter, which provides some of the most popular tools for data analysis across domains of research.
Open for the sake of openness
Jupyter notebooks are the most well-known product of Project Jupyter. They are interactive computational tools enabling “literate programming”, a concept in which one records descriptive prose in a human language, together with related snippets of code that perform bite-sized tasks. You can mix your narrative – what you did and why, what steps you took that led to dead-ends – with the actual code you used, along with the tables and plots it generated. You can then share the notebook with a colleague – or indeed with the wider research community. As long as they have sufficient computing resources and the same freely available open-source libraries installed, they can reproduce your research at the click of a button or change the parameters you chose and run a modified analysis (see box, below).
While it is one thing to share your analysis code with your collaborators, making your code available publicly is a different matter. After all, many scientists who write code do so without professional-level training in the best practices of software development. Some may therefore be reluctant to put their work out there for experts to scrutinize and criticize.
That’s not just an abstract concern, as Bouman found to her cost. The EHT team released its imaging tools when it announced its observation of the black hole. After Bouman’s photo went viral, Internet trolls used this openness – which included information about individual contributions to the collaboration’s analysis – to disparage her input. Despite being a highly qualified engineer and computer scientist, Bouman found herself on the receiving end of harsh, often misogynistic comments about her coding skills. “It’s nerve-racking to put your code out there,” she says, “especially when people are inspecting it and going down deep into things that you did years and years ago.”
So what made the EHT researchers want to share their work so openly? “I think the main goals were transparency, reproducibility and having other people be able to test our methods and code,” Bouman remarks. “We wanted to be as transparent as possible with such an important result. We wanted our ideas to be improved on and we wanted people to use them in different applications.”
Kirstie Whitaker, director of the Tools, Practices and Systems research programme at the Alan Turing Institute in London, UK, repeats the refrain from the “Public Money, Public Code” campaign. “Let go of the idea that you’re giving away this knowledge for free, because actually you’re giving away knowledge that was in many cases paid for by taxpayers,” she says. “It’s not proprietary work, its’s work that should belong to the people who paid for it.”
I think the main goals were transparency, reproducibility and having other people be able to test our methods and code. We wanted to be as transparent as possible with such an important result
Infrastructure as commons
The notion of sharing extends beyond software itself – indeed, it also embraces the services used to support the writing, dissemination and preservation of code-based tools. One such service is Zenodo, an open-source platform developed at CERN. “Zenodo is a platform to which anybody around the world can upload something that they consider a research object worth sharing,” explains Tim Smith, who leads the group in CERN’s IT department that is responsible for user-facing services. These objects could be a figure, a presentation, some data or even code. “The platform publishes it, issues a DOI [digital object identifier] with a guarantee to keep it forever, and makes it available for download to anybody who wants it.” With services like Zenodo, researchers can upload every version of their software libraries or analysis code, knowing that it will be available at the end of a DOI for the foreseeable future.
Another service that is gaining popularity is Binder, run by Project Jupyter. Binder allows you to take a static Jupyter notebook stored on the web, and make it interactive, without having to install any software on your computer. Everything runs in a web browser. In fact, you can even point Binder to properly configured Jupyter notebooks stored on Zenodo. Shortly after announcing the observation of gravitational waves, the LIGO team shared a Jupyter notebook containing a simplified version of their analysis. With a single click, anyone can, in principle, rerun the LIGO analysis in their browser (bit.ly/3iiqso3).
While Zenodo relies on the big-data storage facilities at CERN, Binder requires considerable cloud computing to function. Private cloud providers, such as Google Cloud Platform in the US and OVH in Europe, are the major donors of cloud services to Binder. Increasingly, however, research institutions such as the Alan Turing Institute have joined the fray, by sharing their own cloud infrastructure for Binder. “These institutes use the Binder software internally and also run a public instance of it,” Head explains. “You have to have benefactors like that. The more diverse and smaller the individual contributions are, the more resilient the project is.”
For Whitaker, who established the relationship between Turing and the Binder team, this desire to contribute internal infrastructure to external projects goes back to the idea of the commons. “If everyone plays a small part, you really can do huge and amazing things,” she says. “Where institutions like Turing come in is we just have to do it. By one person making it happen, it makes it more likely for others to do so.”
Citations as currency
Until recently, the role of code and those who write it has not been recognized within academic circles. Kulkarni, who both benefits from open-source software and contributes to it herself, is quite explicit about it. “Traditionally,” she says, “there have been very few positions in academia that have been given to software developers. It’s my opinion that we should have done a bit more to acknowledge the work of tool developers.”
Whitaker offers a simple solution to the problem of acknowledgement. “Cite the freaking software! If everybody recognized all of the shoulders that they were standing upon every time they were doing any part of their work, and they saw what has traditionally been invisible labour, we’d get there immediately.”
Citing software is a relatively new idea and Whitaker acknowledges that authors of software must make it easy for their work to be cited. One way for software developers to do so, is to upload a given version of the code to a repository like Zenodo, and then display the DOI that is assigned to it. But scientists may wish to cite work that is peer-reviewed in some form. The Journal of Open Source Software (JOSS) was established precisely to address this, and conducts a fully open and transparent peer review of research software tools.
“JOSS is a hack on the system,” says Juanjo Bazán, an astrophysicist from the Center for Energy, Environmental and Technological Research in Madrid, Spain. “The founders of the journal noticed that in the research world the role of software developers is not really covered by academia’s credit system; they are not credited as authors of papers.” Anyone who writes a software library that is used in research can send a one- or two-page submission to JOSS, which peer reviews the quality of the software, determines whether it is well written and checks any performance claims made.
The entire process from submission to publication in JOSS happens publicly on GitHub, a web platform for sharing code and collaborating on its development. Submissions that are considered legitimate software papers are not consigned to the discard pile easily. Reviewers, who are not anonymous, openly leave their feedback to the software authors, with the objective of getting the authors to incorporate the suggestions and improve the software. Bazán, who is also a research software developer, was so enamoured by JOSS after submitting a paper to it that he now volunteers as an editor for astrophysics submissions. “JOSS has become popular among research developers,” he says. “We have just published our 1000th paper.”
Selective recognition
Tim Head does not believe that citations of software alone will solve the problem. He notes, first, that scientific software libraries are themselves often built upon existing tools or lower-level libraries – it’s software turtles all the way down. This raises the question of how many levels deep one must go. But on a more practical note, he remarks that having citations for a software paper does not mean recognition – and, more importantly, career progression – will be forthcoming. “A friend of mine is a core contributor to scikit-learn,” he says. “The team wrote a paper about it and it has a very large number of citations. But when he applied for some tenure-track jobs, he was told they would ignore this paper because it’s not ‘real’.”
Smith concurs that flaws in the system persist. “We have a reward system that is based on essentially a 300-year-old process. We have to change that reward system and get it to understand and acknowledge contributions in a digital age.” Those contributions, he explains, are diverse and encompass more than just the analysis process. “Many of our universities and research institutes are already changing their processes to start acknowledging these other contributions and how it all contributes to the advancement of science. All of those contributions should be valued.”
Indeed, the diversity of contributors to open-source software may mean that people seek diverse forms of recognition. Bazán notes that for hobbyists, a simple acknowledgement can be enough. “In April,” he explains, “GitHub talked with NASA and asked the agency to list all the libraries used for the code of the Ingenuity Mars helicopter. Anyone who made a contribution to those libraries now has a small Mars helicopter badge on their GitHub profile page. Maybe that’s enough for some.”
On the other hand, some forms of selective recognition within academia have issues of their own. Take the example of two software developers of the Pythia and Herwig Monte Carlo simulators used in particle physics, who received an award this year from the European Physical Society for their contributions on the software side. Although she appreciates the value in this, Kulkarni is nevertheless critical of what she believes to be the romanticized version of physics as a lone-wolf field, because the award recognized only two people out of dozens of developers.
“You want to have a leader,” Kulkarni says. “One person who is going to advance the field. You see it when people write recommendation letters; they don’t say someone is a great team member, they say they are a great leader. But we have a contradiction between recognizing the value of team work and benefiting from the value of team work.”
Fortunately, a cultural shift is in the making. Research is moving towards greater openness and transparency. With more and more researchers like Bouman, publicly acknowledging and thanking the developers of open-source software, the day may not be far off when software receives both recognition and increased institutional support for its foundational role in modern science.