Data selection for long term preservation

What is data long-term preservation?

Data preservation does not only mean storing data in a safe manner, but it also implies that data will remain accessible and reusable in the long term (for several years or even forever) ensuring:

  • Intellectual interpretability (by providing sufficient metadata and documentation)
  • Technical readability (by using for example appropriate formats)
  • Integrity (by replication of the data and checksum usage)

Long-term preservation (LTP) should be planned since the beginning of the project. The Data Management Plan is a useful tool to describe the preservation strategies that the researchers would like put in place, and to make the monitoring during the project lifespan easier.

In order to  preserve data correctly, appraisal and selection are needed to determine which data will be ultimately devoted to long-term conservation or eliminated. To well structure a data preservation plan, two questions must be answered:

  • What is it worth to be kept?
  • For how long?

Why preserve data?

Providing access to data with adequate metadata is a condition to ensure reproducibility of research results. Moreover, some data are unique and cannot be replaced, so the importance to provide access to them in the long term is even more important.

Long term preservation (and accessibility) of data can also be a funding agencies’ requirement  for some research funding programs.

Who decides?

Deciding which data should be preserved and for how long is a decision that belongs to the research team. However, the different stakeholders of a project (funders, research institutions, publishers, etc.) might have specific requirements that should be considered when defining the LTP strategy.

What does preservation cost?

Preservation costs have to be considered and included in a research project budget as part of the general data management costs:

  • Data curation costs include resources needed to manage data during the project, to prepare them before depositing them in a repository and to be the respondent of the data thereafter.
  • Repository costs are the charges that can be applied by the repositories for data deposition. These costs depend on different elements, including the dataset size (if the dataset is big these costs could be very high). More information about that can be found in the Data repositories and data journals section.

How to select data to preserve?

In order to deposit data in a repository, a set of conditions must be fulfilled:

  • Being the owners of the data (or having the consent of the involved stakeholders)
  • Complying with the data protection law (anonymisation, restricted access, etc.)
  • Ensuring data integrity and accessibility (no corrupted data, availability of software and hardware, etc.)
  • Providing appropriate metadata to ensure data intelligibility
  • Clarifying the conditions of reuse with adequate license

It does not make sense to preserve data if any of the above conditions are not fulfilled. In general, good data management and curation throughout the whole project suffice to prevent such limitations.

In Switzerland, there is no legal obligation to preserve or publish research data so far. However several constraints can be imposed by the stakeholders of a research project.

  • Funders: some of them (such as the European Commission with Horizon 2020 and the Swiss National Science Foundation, as described in detail in the “Funder’s data requirements” page) require a DMP. One of its section is especially focused on the data preservation strategies, helping the researchers in planning them since the beginning of the project.
  • Publishers: an increasing number of them require that the data underlying the articles are made accessible, as indicated in the “Publisher’s requirements” page.
  • Partners: when a project is carried out in partnership with research teams outside EPFL, it is important to define who will own the data and the final destination of them once the project is completed. Private partners (if any) can also set conditions on the use of the data collected.
  • Data repositories: each repository has its own policies regarding the type of accepted data (disciplines, formats, size, etc.). The choice of a relevant one should be made early enough in order to meet its requirements adequately.

The choice of the stakeholders often has an impact on the data management. This question has to be taken into account when looking for funders, repositories, partners, etc. For example, the Swiss National Science Foundation excludes for profit data repositories.

If the prerequisites are fulfilled and the requirements of the stakeholders are not sufficient to determine the selection of the data to preserve, there are several more qualitative criteria:

  • Ethical issues: these can either restrict or encourage the publication of data. On one hand, if a misuse of the data is possible, it would be a bad decision to publish them. On the other hand, scientific ethics encourages data transparency, sharing and publication
  • Value of the data (uniqueness, cost to harvest, links with other dataset, science trends, potential reuse, etc.)
  • Quality of data documentation and metadata
  • Quality and reliability of the sources and harvest methods
Preservation costs, especially data curation costs and repositories fees, must be also taken into account to determine the size of the data to preserve and the duration.

Does preserving data mean publishing data?

Most often, if datasets have been preserved it is also to be published, but it is not always possible. For example, when there is some restrictions, as copyright or privacy issues for example. Sometimes, researchers decide to work further on a dataset and prefer to restrict the access to these data during this stage.

It is also possible to ensure the preservation of a dataset while retaining control over its use by others. Depending on the chosen data repository, access restrictions, embargoes and sampling can be set up.

How long to preserve data?

Several repositories define the retention time of data sets, but most of them do not fix a limit. The relevant duration depends on the value of the data and, in particular, on the potential for reuse, which is likely to decrease over time. In a general way, 5 to 20 years of preservation seems reasonable.

Which data to preserve? Raw data or processed data? What about sampling?

Raw data is data in original state at the time of collection. Processed data is the data transformed and used to analyse the research questions. Which ones to preserve? It depends on the purpose of the data preservation.

If checking the validity of the research results is required, relevant raw data must be at least preserved. However, these is not sufficient to reproduce the results. The code and algorithms used to process these data need to be provided, as well as sufficient metadata to explain how they had been processed.

Sampling is also to be considered to reduce the costs. An option is to only preserve the data directly useful to validate the results or even less if there is no need to be able to prove the results.