At EPFL, the Library Research Data Management team is at your disposal to provide you with expert advice, support, and solutions. On this page, you can also find many tools and guides to master active data management.
During the lifespan of a project, researchers have to deal with data management on a daily basis. Active research data management refers to the tasks and tools required to ensure that data, code, and related information remain organized and safely backed up, for research to be reproducible and secure. The key to good active research data management is data documentation.
1 | Formatting
A file format is a standard way to encode data for storage in a computer file. File formats can be proprietary or free and can be unpublished or open.
When selecting file formats, possibly choose formats that are interoperable among various platforms and applications, open, and commonly used by the research community. If data are stored in one format during collection and analysis, and then transferred to another format for preservation, be careful to list out features that may be lost in data conversion.
2 | Metadata
Metadata is “data that provide information about other data” (source: Merriam-Webster). Data documentation and metadata provide essential information about the context, structure, provenance, and content of the data. The goal is to allow others (including your future self) to use and interpret the data. The minimum documentation of a dataset is to describe it within a README file and, if appropriate, a naming convention.
3 | Version control
Version control is a method that allows keeping track of file (data and code) changes over time, so that older versions remain accessible in the future.
4 | Processing environment and workflows
A scientific workflow is a formal definition of the research process. In addition to automating tasks, such formalization increases research reproducibility. Workflows are made of a series of computational or data manipulation steps and are machine-readable. Scientific workflow management software allows one to easily manage complex or repetitive operations.
5 | ELNs and LIMS
Electronic laboratory notebooks (ELNs) and laboratory information management systems (LIMS) are applications that allow researchers in a laboratory to track samples and test results. An electronic laboratory notebook (ELN) replicates a digital version of the traditional paper notebook with the advantage of many built-in features.
6 | Back-up and cloud
For general storage, use File Storage, the central storage and backup service by EPFL VPO. It also offers an “object storage” hosted on-site and based on Open Standard S3 protocol: use the XaaS portal to request for buckets.
For storage capacity and help, please refer to your Faculty-IT.
- ENAC-IT: epfl.ch/schools/enac/fr/a-propos/enac-it/
- SV-IT: epfl.ch/schools/sv/it (in the case of sensitive data, check secure data acquisition at https://redcap.epfl.ch)
- STI-IT: epfl.ch/schools/sti/it/
- SB-IT: https://sb-it.epfl.ch/ (currently not available)
- IC-IT: https://www.epfl.ch/schools/ic/it/en/it-service-ic-it/
- CDM-IT: https://www.epfl.ch/schools/cdm/college-of-management-of-technology/about/internal-services/it-services/it-administration
7 | Synchronize and share
File synchronization and research data sharing can be done through various platforms, depending on the needs and location of the data and partners. Since cloud-based solutions are often chosen, remember to check for any personal data protection or sensitive data.
Useful tools
1 & 2 | Format and metadata
Dublin Core
A simple set of 15 terms that can be used to describe datasets and more generally electronic resources. The Qualified Dublin Core is an extension of the terms, adding notably the ability to refine the semantics via standard controlled vocabularies.
CSV on the web
The Comma Separated Values on the web is a recommendation for documenting CSV files, often difficult to reuse due to the lack of description of their structure, content or relation to other tabular data files.
HDF5
Hierarchical Data Format version 5: a set of data formats supported by many platforms, including Java, Matlab, Octave, Mathematica, Python, R and Julia, with interesting metadata capabilities.
DataCite Metadata
One of the most popular descriptive metadata, much more precise than Dublin Core. However, it requires much more work to implement and is most frequently used by professional data repositories.
––up ↑
3 | Version control
GitLab EPFL
The GitLab Community Edition of the EPFL allows to use the GitLab functionalities of version control, etc., within the comfort of the EPFL access and storage, as the main open alternative to GitHub.
GitHub
Commercial platform for code co-creation, curation, sharing and testing. It allows code version control and easy collaboration, but the storage is not on local servers and cannot be self-hosted. You should never store legally protected data on it, which would subject to incurring penalties. Via the GitHub API, GitHub itself suggests to use Zenodo to make your code citable and obtain a DOI.
–up ↑
4 | Processing environments and workflows
protocols.io
Platform for collaboratively developing, storing, organizing and searching reproducible methods, procedures, manuals, protocols, etc. and also publishing them with a DOI.
Renku
Software platform that enables reproducible and collaborative data science, with reproducible analyses and automatic generation of Knowledge Graphs.
OMERO
An EPFL-BIOP image data management system designed for the EPFL Microscopy community to support the vast amount of imaging data. It provides easy ways of storing, accessing, displaying and working with large amount of imaging data.
AiiDA
Automated Interactive Infrastructure and Database, especially useful for Computational Science. It is developed publicly on the aiida-core GitHub repository.
Snakemake
Workflow management system providing a fast and comfortable execution environment, to reduce worlkflow complexity. It offers a clean and modern specification language in Python style.
syre
Free and open source software program used to manage and analyze data. Thot uses top-down organization and bottom-up analysis, implemented in a tree structure. Its easy concept and clear visualization, make it ideal for establishing and reproducing data workflows.
–up ↑
5 | ELNs and LIMS
ELN EPFL
Especially conceived around chemistry needs, the it is an Electronic Laboratory Notebook as well as a repository for spectroscopic data, with some helpful tools.
SLIMS
An EPFL spinoff offering a Laboratory Information Management System (LIMS), now owned by Genohm SA. Can be locally hosted on EPFL servers. It offers ELN-like functionalities (file organization by projects, experiments, notes, etc.) plus samples management, and advanced functions especially useful for life-science projects.
ELN Comparison Matrix
A matrix table, comparing and contrasting various Electronic Lab Notebook (ELN) options, with the goal to aid researchers in the process of identifying appropriate solutions. Based on a survey of 26 vendors, this ELN comparison matrix serves as an educational tool and decision map.
openBIS
A combined data management platform for (i) inventory management; (ii) electronic lab notebook and (ii) research data management. Developed by the ETHZ and available Open Source.
–up ↑
6 | Back-up and cloud
SWITCHDrive
Cloud solution for Swiss universities with 100 GB online storage per user (standard), institutional login, real-time collaboration on documents. Data are stored and processed in Switzerland. The multi-platform client uses end-to-end encryption, making it easy to comply with Federal Act on Data Protection and the official secrecy laws. (Note: transfer encryption is not file encryption. Users have to encrypt the data themselves as needed.)
GDrive EPFL
Access EPFL’s Google Apps for Education (Storage, Docs, Sheets, Slides, Forms, Google Public Groups) with your EPFL email address. This service is not for private use, but intended for your work at EPFL. You should never store legally protected data on it, which would subject to incurring penalties. It is advised not to store any administrative document in the cloud.
Atempo Lina
Cross-platform software for backups, available to EPFL’s staff, and supported by the EPFL IT for central backup of end-users devices. It allows the automatic backup of EPFL users’ computers: the installed agent sends encrypted data to onsite servers. Users can restore data at any time. Contact your Faculty IT or [email protected].
–up ↑
7 | Syncronisation and sharing
rsync
Open-source utility for synchronizing and transferring files across computer systems with minimal network usage. Also used by SCITAS.
Sync comparison table
From Wikipedia, an exhaustive comparisation of file synchronization software, classified by language programming, platforms, licenses and much more