Data repositories are infrastructures allowing to preserve and/or publish research outputs.
A distinction can be made between disciplinary and multi-disciplinary repositories:
- Disciplinary repositories are generally a good choice, since they are adapted to subject-specific data and could be more well-known in the disciplinary community. However, data stewardship requires a lot of resources (human, machine time) and some small disciplinary repositories do not always meet basic data management standards.
- Multi-disciplinary repositories accept any type of data, and some of them offer excellent data management services, even for free.
Finding the right repository
When choosing among different repositories, it is important to consider the following elements to find the most relevant one and maximise the impact of data:
- disciplinary data sharing practices;
- disciplinary/community standard repositories;
- combination of ease of deposit, accessibility, discoverability, curation, preservation infrastructure, organizational persistence and support for used formats and standards.
“Re3data is a global registry of research data repositories from all academic disciplines. It provides an overview of existing research data repositories in order to help researchers to identify a suitable repository for their data” (Wikipedia). Re3data indexes over 1500 repositories and offers search filters.
Some of them are more of significance than others, notably:
- Subjects: useful to narrow a search to repositories relevant to your discipline. However, take into account that some multi-disciplinary repositories may be better solutions than subject specific ones, especially sizeable well-curated tools such as Zenodo, Dryad or Figshare.
- Certificates: attest that a repository is visible, well curated, and that its data are well described and of good quality. Especially Data Seal of Approval (DSA) and World Data System (WDS) certificates are relevant.
- Data access: an open access to the data will encourage the reuse and citation of your work.
- Data license: the use of acknowledged data licenses implies a clear definition of what users may or may not do with a dataset. Notably Creative Commons licenses (CC-BY, CC0) allow to give or retain various rights on datasets. They are relatively easy to understand, and at the same time, legally well defined and machine-readable. For computer code, the following licenses are to be considered: Apache, Berkeley Software Distribution (2 and 3 close BSD Licenses), GNU Public Licenses (GPL, LGPL, AGPL), Public Domain.
- Metadata standards: used to describe datasets efficiently, which is essential for their reuse and discoverability. The support of Dublin Core (DC) offers a minimal simple description.
- PID Systems: persistent identifiers enable to cite efficiently a data set, and are built to avoid broken links. The Digital Object Identifier (DOI) and handle system (HDL) are the most common PID Systems.
- AID Systems: author identifiers facilitate the discovery of an author’s work through an unambiguous identification. Among them ORCID is valuable.
Most commonly used multy-disciplinary repositories
Zenodo is a repository operated by CERN covering all scientific disciplines. It offers free data submission for any research as long as it is openly published. In addition, DOI are systematically attributed to records, making them cleanly citable. Another notable feature is its integration with GitHub, enabling to capture, preserve and cite Git repositories.
Dryad is a curated general-purpose scientific data repository. All records in Dryad are associated to published articles, and a data publishing fee is requested for deposition (more details available here). DOI are attributed systematically.
Figshareoffers free data deposition and access for all disciplines, and attributes systematically DOI. Unlike Zenodo and Dryad, Figshare is a commercial repository, belonging to Macmillan Publishers.
Data journals are emerging publications whose main purpose is to make research data discoverable, interpretable and reusable, providing impact and recognition for authors.
Datasets are now being recognized as a primary research outputs, so it can be an interesting option to present them in a data paper. This allows the author to focus on the description of the data, its context, the acquisition methods, as well as its actual and potential use (rather than presenting new hypothesis or interpretations).
Moreover, authors can get credit as data article are peer-reviewed publications and citable. As data journals are always Open Access, an Article Processing Charge (APC) has to be paid by the author for the publication costs. It is possible to request for the Library financial support to cover part of the APC.