Active data management

Discipline specific tools may assist the management of your active data, from creation, to processing and analysis phases of the data life cycle. Particularly:

  • Source code version control systems
  • Electronic laboratory notebooks (ELN) and laboratory information management systems (LIMS)
  • Computational workflow engines
  • Computer science notebooks and environments (see “Data analyse and visualization” page)

Source code version control systems

c4science: EPFL project with many useful features, notably:

  • Version control system (support Git, Subversion and Mercurial)
  • Unlimited number of public and private project/repositories
  • Hosted by SWITCH (in Lausanne with a backup in Zurich) and accessible to the whole Swiss academic community
  • Repositories can easily be made accessible to research partners outside Switzerland
  • Additional features:
    • Documentation (wiki);
    • Project management (tasks);
    • Continuous integration (Jenkins)

gitlab.epfl.ch: git-basedcollaborative platform:

  • Hosted at EPFL and available for the EPFL community
  • User-friendly interface

ELN / LIMS

Electronic Laboratory Notebooks (ELN) are software replacing paper laboratory notebooks and more. They allow collaborative work and support native digital content (such as microscopy, gels images, DNA sequences, etc.). Depending on the tool, they may have the same legal value as signed paper notebooks.

Laboratory Information Management Systems (LIMS) are information management software supporting modern laboratory operations, such as laboratory equipment and samples’ management, including their location and associated data.

ELN/LIMS are tools combining the two sets of functions.

At EPFL, the following systems are available:

  • Life sciences : SLims which is an ELN/LIMS (actually also used by some labs in STI)
  • Chemistry : eln.epfl.ch, an in house developed system accessible to all EPFL members

Computational workflow

A scientific workflow is a formal definition of the research process. In addition of automating tasks, such formalization increases research reproducibility. Workflows are made of a series of computational or data manipulation steps and are machine-readable. Scientific workflow management software allow to easily manage complex or repetitive operations.

SnakeMake

SnakeMake is a workflow management system that aims to reduce the complexity of creating workflows by providing a fast and comfortable execution environment, together with a clean and modern specification language in python style. Snakemake workflows are essentially Python scripts extended by declarative code to define rules (for more information you can refer to the Snakemake’s documentation page).

Snakemake supports:

  • Remote files handling (http-s, sftp, dropbox, googledrive)
  • Data provenance and rule versions
  • Parallelization
  • Suspend / resume
  • Logging
  • Graphical worfkow generation

AiiDA

“AiiDA is a flexible and scalable informatics’ infrastructure to manage, preserve, and disseminate the simulations, data, and workflows of modern-day computational science.

Able to store the full provenance of each object, and based on a tailored database built for efficient data mining of heterogeneous results, AiiDA gives the user the ability to interact seamlessly with any number of remote HPC resources and codes, thanks to its flexible plugin interface and workflow engine for the automation of complex sequences of simulations” (AiiDA website). AiiDA is developed at EPFL.

Taverna

“Taverna is an open source multi-platform tool for designing and executing workflows. Taverna is discipline independent and used in many domains, such as bioinformatics, cheminformatics, medicine, astronomy, social science, music, and digital preservation” (Wikipedia). It is composed of several tools, among which:

  • Taverna Workbench: desktop application enabling to graphically create, edit and run workflows
  • Taverna Command Line: enables to run commands form prompt, e.g. for automated execution
  • Taverna Server: remote workflow execution service, enabling to set up a dedicated server

Taverna is also modular: many plugins are available. Taverna supports myExperiment, and thus allows re-using or sharing workflows in a few clicks. Existing workflows are a great source of inspiration to develop your own workflows, either through embedding them directly as sub-worflows or by simply using them as starting points for your own designs.

Pegasus

Pegasus runs on various environments including personal computers, campus clusters, grids, and clouds. It is quite flexible, but more difficult to learn than Taverna. No graphical design tool is available.

Pegasus helps constructing workflows in abstract terms without worrying about the details of the underlying execution environment or the particulars of the low-level specifications required by the middleware (Condor, Globus, or Amazon EC2).

Pegasus is used in many of scientific domains including astronomy, bioinformatics, earthquake science, gravitational wave physics, ocean science, limnology, and others.

Pegasus keeps track of what has been done (provenance) including the locations of data used and produced, and which software was used with which parameters.

Pegasus has a number of features that contribute to its usability and effectiveness:

  • portability and reuse
  • performance and scalability
  • provenance and data management
  • reliability and error recovery