Active data management


At EPFL, the Library Research Data Management team is at your disposal to provide you with expert advice, support, and solutions. On this page, you can also find many tools and guides to master active data management.


During the lifespan of a project, researchers have to deal with data management on a daily basis. Active research data management refers to the tasks and tools required to ensure that data, code, and related information remain organized and safely backed up, for research to be reproducible and secure. The key to good active research data management is data documentation.

1 | Formatting

A file format is a standard way to encode data for storage in a computer file. File formats can be proprietary or free and can be unpublished or open.
When selecting file formats, possibly choose formats that are interoperable among various platforms and applications, open, and commonly used by the research community. If data are stored in one format during collection and analysis, and then transferred to another format for preservation, be careful to list out features that may be lost in data conversion.

 

2 | Metadata

Metadata is “data that provide information about other data” (source: Merriam-Webster). Data documentation and metadata provide essential information about the context, structure, provenance, and content of the data. The goal is to allow others (including your future self) to use and interpret the data. The minimum documentation of a dataset is to describe it within a README file and, if appropriate, a naming convention.

 

3 | Version control

Version control is a method that allows keeping track of file (data and code) changes over time, so that older versions remain accessible in the future.

 

4 | Processing environment and workflows

A scientific workflow is a formal definition of the research process. In addition to automating tasks, such formalization increases research reproducibility. Workflows are made of a series of computational or data manipulation steps and are machine-readable. Scientific workflow management software allows one to easily manage complex or repetitive operations.

 

5 | ELNs and LIMS

Electronic laboratory notebooks (ELNs) and laboratory information management systems (LIMS) are applications that allow researchers in a laboratory to track samples and test results. An electronic laboratory notebook (ELN) replicates a digital version of the traditional paper notebook with the advantage of many built-in features.

 

6 | Back-up and cloud

For general storage, use File Storage, the central storage and backup service by EPFL VPO. It also offers an “object storage” hosted on-site and based on Open Standard S3 protocol: use the XaaS portal to request for buckets.

 

For storage capacity and help, please refer to your Faculty-IT.

7 | Synchronize and share

File synchronization and research data sharing can be done through various platforms, depending on the needs and location of the data and partners. Since cloud-based solutions are often chosen, remember to check for any personal data protection or sensitive data.

  

Useful tools

 

1 & 2 | Format and metadata

Dublin Core

A simple set of 15 terms that can be used to describe datasets and more generally electronic resources. The Qualified Dublin Core is an extension of the terms, adding notably the ability to refine the semantics via standard controlled vocabularies.

CSV on the web

The Comma Separated Values on the web is a recommendation for documenting CSV files, often difficult to reuse due to the lack of description of their structure, content or relation to other tabular data files.

HDF5

Hierarchical Data Format version 5: a set of data formats supported by many platforms, including Java, Matlab, Octave, Mathematica, Python, R and Julia, with interesting metadata capabilities.

DataCite Metadata

One of the most popular descriptive metadata, much more precise than Dublin Core. However, it requires much more work to implement and is most frequently used by professional data repositories.

up ↑

3 | Version control

c4science

Platform for scientific code co-creation, curation, sharing and testing. Integrated with Git, it allows code version control, easy collaboration, and the use of EPFL storage solution for your projects. It is available to the entire Swiss universities community and accessible to external collaborators.

GitLab EPFL

The GitLab Community Edition of the EPFL allows to use the GitLab functionalities of version control, etc., within the comfort of the EPFL access and storage, as the main open alternative to GitHub.

GitHub

Commercial platform for code co-creation, curation, sharing and testing. It allows code version control and easy collaboration, but the storage is not on local servers and cannot be self-hosted. You should never store legally protected data on it, which would subject to incurring penalties. Via the GitHub API, GitHub itself suggests to use Zenodo to make your code citable and obtain a DOI.

up ↑

4 | Processing environments and workflows

protocols.io

Platform for collaboratively developing, storing, organizing and searching reproducible methods, procedures, manuals, protocols, etc. and also publishing them with a DOI.

Renku

Software platform that enables reproducible and collaborative data science, with reproducible analyses and automatic generation of Knowledge Graphs.

OMERO

An EPFL-BIOP image data management system designed for the EPFL Microscopy community to support the vast amount of imaging data. It provides easy ways of storing, accessing, displaying and working with large amount of imaging data.

AiiDA

Automated Interactive Infrastructure and Database, especially useful for Computational Science. It is developed publicly on the aiida-core GitHub repository.

Snakemake

Workflow management system providing a fast and comfortable execution environment, to reduce worlkflow complexity. It offers a clean and modern specification language in Python style.

Thot – Data

Free and open source software program used to manage and analyze data. Thot uses top-down organization and bottom-up analysis, implemented in a tree structure. Its easy concept and clear visualization, make it ideal for establishing and reproducing data workflows.

up ↑

5 | ELNs and LIMS

ELN EPFL

Especially conceived around chemistry needs, the it is an Electronic Laboratory Notebook as well as a repository for spectroscopic data, with some helpful tools.

SLIMS

An EPFL spinoff offering a Laboratory Information Management System (LIMS), now owned by Genohm SA. Can be locally hosted on EPFL servers. It offers ELN-like functionalities (file organization by projects, experiments, notes, etc.) plus samples management, and advanced functions especially useful for life-science projects.

ELN Comparison Matrix

A matrix table, comparing and contrasting various Electronic Lab Notebook (ELN) options, with the goal to aid researchers in the process of identifying appropriate solutions. Based on a survey of 26 vendors, this ELN comparison matrix serves as an educational tool and decision map.

openBIS

A combined data management platform for (i) inventory management; (ii) electronic lab notebook and (ii) research data management. Developed by the ETHZ and available Open Source.

up ↑

6 | Back-up and cloud

SWITCHDrive

Cloud solution for Swiss universities with 100 GB online storage per user (standard), institutional login, real-time collaboration on documents. Data are stored and processed in Switzerland. The multi-platform client uses end-to-end encryption, making it easy to comply with Federal Act on Data Protection and the official secrecy laws. (Note: transfer encryption is not file encryption. Users have to encrypt the data themselves as needed.)

GDrive EPFL

Access EPFL’s Google Apps for Education (Storage, Docs, Sheets, Slides, Forms, Google Public Groups) with your EPFL email address. This service is not for private use, but intended for your work at EPFL. You should never store legally protected data on it, which would subject to incurring penalties. It is advised not to store any administrative document in the cloud.

Atempo Lina

Cross-platform software for backups, available to EPFL’s staff, and supported by the EPFL IT for central backup of end-users devices. It allows the automatic backup of EPFL users’ computers: the installed agent sends encrypted data to onsite servers. Users can restore data at any time. Contact your Faculty IT or [email protected].

up ↑

7 | Syncronisation and sharing

rsync

Open-source utility for synchronizing and transferring files across computer systems with minimal network usage. Also used by SCITAS.

Druva Insync

Centralized back-up and synchronization system maintained by EPFL VPO, allowing the automatic backup of user data on their PCs.

Sync comparison table

From Wikipedia, an exhaustive comparisation of file synchronization software, classified by language programming, platforms, licenses and much more

up ↑

Contact

[email protected]


+41 21 693 21 56


Access map