Reproducibility PRESERVATION: Taking the Pulse

Our final community meeting in this series, on preservation and reproducible research took place on July 29, 2021 (see meeting notes, slides, and motivating questions). About 20 people participated in the conversation representing a variety of stakeholders.

Among the top priorities the group singled out on the topic of preservation and reproducible research are responsibility, maintenance, and timeframe (see notes). The summary below captures the main themes of the conversation, highlighting key questions.

Reproducibility presents new challenges to preservation

  • Does preservation of computationally reproducible research require a paradigm shift in the field of digital preservation?

Digital preservation is a “series of managed activities, policies, strategies and actions to ensure the accurate rendering of digital content for as long as necessary, regardless of the challenges of media failure and technological change.” A fundamental target of digital preservation is the digital object, and a primary concern is bit-level preservation, “literally preserving the bits forming a digital object.” The idea is to make sure the integrity of the bits is intact over time.

Where computational reproducibility is concerned, as pointed out by participants, the target for preservation is better thought of as a “performance.” In the context of reproducibility, we are interested in preserving the execution of a computational process, often as it relates to specific input and output data. This also requires the preservation of the various components that enable the performance or the process, including the data, the software, the state of the computer, and so on.

Importantly, the nature and goals of different types of “performances” – executions of a computational process – vary potentially affecting what needs to be preserved and how. For example, reproducing computation that performs data analysis for the purpose of verifying a specific scientific claim is different from computation that performs modeling and simulation, with implications for what needs to be preserved. Guidelines about whether it’s enough to preserve the documentation or metadata reproduction no longer works, and what documentation we need to preserve different types of computation, would be helpful. Context is important and guidelines and standards can help steer toward preservation actions that are appropriate for what it is that the computation aims to reproduce.

Participants expressed support for approaches that are aimed at reducing dependencies as a way to facilitate both preservation and reuse. Efforts specifically designed to support the preservation of computational research mentioned at the meeting include emulation (e.g., EaaSI provides shared infrastructure for long term access to emulated hardware, maintaining the contextual space of the software and systems required to reproduce computation) and packaging (e.g., RO-Crate is a lightweight approach to packaging research data with their metadata).

  • Are there sufficient opportunities for cross-fertilization between digital preservation experts and those implementing reproducible research policies, workflows, and tools?

It seems likely that a combination of these various approaches and a commitment by various stakeholders will be needed. This is a recurring theme in digital preservation writ large, as noted by the Digital Preservation Coalition, “digital preservation cannot be perceived as solely a concern for archives, libraries, museums and other memory institutions: it is a challenge for all who have an interest in creating, using, acquiring and making accessible, digital materials.” That said, it is often unclear who is responsible for preservation of reproducible research (this issue has come up in prior community conversations, e.g., on publishing, solutions).

  • Who should be responsible for preserving reproducible research?

Currently, preservation is usually attended to at the end of the research cycle, but there are relevant touch points along the entire process. While repositories “do” preservation, it is better when all the stakeholders work toward this goal. Preservation is better supported when research is done with an eye toward reproducibility from the beginning, when proper data management is performed throughout, and when curators are involved before the research is shared, among other things.

  • How do we develop and support a skilled and professional workforce of archivists and preservation experts to work alongside researchers?

Importantly, individuals assigned to preservation tasks must have the skills to perform those tasks (see previous conversation about training) and supporting resources and infrastructure (i.e., guidelines, standards, policies, established practices, tools).

Gaps in tools and infrastructure

With respect to tools, some may facilitate reproducibly but not lend themselves easily to preservation. For example, container technology is a popular solution for reproducibility but introduces dependencies which can be a challenge for preservation. At this time there is no standardized approach to documenting essential information for preservation such as the operating system and runtime required, to run the container and how to connect to a data source that may be archived elsewhere.  

Participants also expressed reservations about the development and use of commercial tools that enable and facilitate reproducibility. Commercial tools require licenses and are often “black boxes,” presenting challenges to long-term preservation and reuse. One idea might be to encourage commercial toolmakers to incorporate preservation-supporting functionality into their solutions. The recent GitHub integration with Zenodo can be thought of as a positive example of this. An example from the open source community is the EaaSI project’s concerted effort to contribute detailed documentation to Wikidata to ensure its value for the long term. As noted in previous conversations, a clear message on the community’s position about open vs. commercial reproducibility tools – and specifically as it pertains to preservation – can inspire the creation of guidelines for researchers, publishers, and repositories on selecting tools that lend themselves to preservation.

National infrastructure projects for data management and preservation offer benefits such as resources and funding. This kind of investment in repositories is encouraging, but more emphasis is needed on preserving a diversity of materials such as software and old compiled operating system applications (e.g., Software Heritage and EaaSI).

Gaps in policy and culture

Participants pointed to gaps in policy and culture when it comes to preservation and reproducible research. These gaps need to be addressed if the community is interested in ensuring that responsibility for preservation is optimally distributed.

  • Do we need preservation policies that are specific to reproducibility, or do general policies (i.e. on other research objects) suffice?

In terms of policies governing preservation and reproducible research, there’s a need to involve experts from various disciplines from the beginning (researchers, technology developers, publishers, etc.). There is currently a gap is communicating preservation requirements to tool developers and publishers and funders who write policies. There are also inconsistent practices on the part of researchers, and varying levels of awareness and expertise of the issues across disciplines. Computationally heavy disciplines naturally understand reproducibility and are able to communicate their needs to those developing systems, tools, services. A community-wide effort to clarify requirements and establish standards will support more comprehensive and responsive preservation efforts.

The following topics could be clarified:

The difference between preservation of data and preservation of materials for reproducibility. Standards would support scaling efforts which are necessary in order to move beyond bit-level preservation. Data-specific guidelines, e.g., legal or ethical implications to consider when preserving human-related data used in reproducible research, continue to apply and should be reinforced.  

Preservation of the linkage between the artifacts. What are the pros and cons of different approaches when it comes to preservation? e.g. renku tool, research compendia, Docker, ReproZip. Making all research outputs FAIR can help in this regard.

Preservation time horizon, or the duration that the research is reproducible. Some funders have requirements around the timeframes for the preservation of research data. If these were to be extended to “reproducible research outputs,” it raises a question around what is feasible in practice given current technological means and the constant evolution of computing. Most of the required timeframes for data preservation seem relatively short – could they become “indefinite”? Should they? Educating stakeholders about the limits of digital preservation is important.

  • What should be the role of the ACM in advancing preservation and reproducible research?

ACM general guidelines on workflows that facilitate preservation of reproducible research would lead to consistency across authors and conferences. Already practices in Computer Science, such as working reproducibly and collaboratively and incorporating code review, are common. Each SIG can then adopt workflows that suit its needs while adhering to the general standards. Speaking in a unified voice, the organization is positioned to be a leader in this area.

A note: Thank you to everyone who has participated in this virtual conversation series! Please take a moment to fill out the feedback survey.

By Limor Peer and Vicky Rampin

August 6, 2021