Swiss Data Science Center

This page lists the Swiss Data Science Center projects available to EPFL students.

The SDSC is a joint venture between EPFL and ETH Zurich. Its mission is to accelerate the adoption of data science and machine learning techniques within academic disciplines of the ETH Domain, the Swiss academic community at large, and the industrial sector. In particular, it addresses the gap between those who create data, those who develop data analytics and systems, and those who could potentially extract value from it. The center is composed of a large multi-disciplinary team of data and computer scientists, and experts in select domains, with offices in Lausanne and Zurich.

datascience.ch



It may be possible to convert a thesis project into a semester project or extend a semester project to be suitable for a thesis project. If any of the present or past projects interests you, please feel free to contact us. We are always looking forward to meeting motivated and talented students who want to work on exciting projects.

These projects are closed for applications

Master Projects – Autumn 2018


Laboratory: Swiss Data Science Center

Type: Master Thesis Project

Description:

The Swiss Data Science Center is developing a cloud-based platform for collaborative data science. The platform provides a one-stop shop to data and algorithms, enabling data scientists to easily discover and reproduce the work of their peers in a secure collaborative environment. To this end, the platform automatically record the data science workflows, and relationships between research artefacts (code, data, results), into a knowledge representation. Scientists can query this knowledge representation using clauses that may include relationship expressions such as find all research projects and results derived from a data set or a class of data sets.

This internship is about developing a Proof of Concept (PoC) to offer a unified query engine capable to answer the query when this knowledge representation is decentralized. In the proposed scenario, the query must be decomposed into subqueries and executed on multiple database servers, possibly hosted in different administrative domains governed by independent access rights.

Goals/Benefits:

– Practical experience in developing complex large scale software systems
– Becoming familiar with application containerization in cloud-based environment
– Becoming familiar with state-of-the art big data solutions, such as database graphs
– Working in an interactive and interdisciplinary research environment

Prerequisites:

– Intermediate level experience in using Linux
– Beginner experience with application containerization in cloud-based environment
– Good Python or Scala programming skills
– Good software engineering skills

Contact: Eric Bouillet [email protected]

Laboratory: Swiss Data Science Center

Type: Master Thesis Project

Description:

The Swiss Data Science Center is developing a cloud-based platform for collaborative data science. The platform provides a one-stop shop to data and algorithms, enabling data scientists to easily discover and reproduce the work of their peers in a secure collaborative environment. To this end, the platform provides methods to express, share and run data science workflows contributed by the data scientists in the cloud. Workflows are currently formulated in the SDSC collaborative data science platform as Direct Acyclic Graphs (DAG) using the Common Workflow Language (CWL).

This internship is about designing a like declarative workflow language similar to GNU-make, and developing a Proof of Concept (PoC) to run the flows in a distributed application container orchestration environment such as Kubernetes.

Goals/Benefits:

Practical experience in developing complex large scale software systems
Becoming familiar with state-of-the art application containerization and orchestration technologies such as docker and kubernetes.
Becoming familiar with cloud-based application development.
Working in an interactive and interdisciplinary research environment.

Prerequisites:

Intermediate level experience in using Linux
Beginner level experience with application containerization and orchestration
Good Python or Scala programming skills
Good software engineering skills

Contact: Thiebaut Johann-Michael Raymond [email protected]

Laboratory: Swiss Data Science Center

Type: Master Thesis Project

Description:

The Swiss Data Science Center is developing a cloud-based platform for collaborative data science. The platform provides a one-stop shop to data and algorithms, enabling data scientists to easily discover and reproduce the work of their peers in a secure collaborative environment. Using this platform, users can access data and run data analytics in a cloud-based computing environment managed by the platform.

This internship is about designing, implementing and testing a proof of concept of an Attribute Based Access Control (ABAC) systems to authorize the access to the resource entities managed by the platform. The candidate will first demonstrate a policy decision point that grant access rights to users based on policies expressed in the form of Boolean rules that combine attributes from the user, the accessed resource and the environment. Next, the candidate will design an ABAC solution capable to operate in a federated environment, where resources are distributed across multiple administrative domains protected by respective policy decision points with individual access policies.

Goals/Benefits:

Practical experience in developing complex large scale software systems
Becoming familiar with state-of-the art application containerization and orchestration technologies such as docker and kubernetes.
Becoming familiar with cloud-based application development.
Becoming familiar with state of the art access control paradigms
Working in an interactive and interdisciplinary research environment.

Prerequisites:

Intermediate level experience in using Linux
Beginner level experience with application containerization and orchestration
Good Python or Scala programming skills
Good software engineering skills

Contact: Sandra Savchenko-de Jong [email protected]

Laboratory: Swiss Data Science Center

Type: Master Thesis Project

Description:

The tremendous advancement in machine learning algorithms over the last few decades has accelerated the adoption of neural networks and deep learning architectures in many applications such as image classification, natural language processing, and human action recognition [1]. These recent methods have led to impressive performance that even come close to the ones of humans on certain recognition or classification tasks. However, these systems are often poorly understood and generally developed without performance guarantees. Even worse, their results are often hard to interpret by application domain experts; the deep neural network algorithms are mainly based on non-linear functions, which map raw data, such as image pixels, to some feature representations that are hard to interpret in terms of a priori domain knowledge. Thus, although the popular neural network architectures are highly successful in terms of performance, their lack of transparency is a significant impediment to their adoption as advanced data science techniques in sensitive applications such as medical diagnosis.

The goal of this project is to attempt to interpret deep architectures by studying the structure of their inner layer representations, and based on this structure to find coherent explanations about their classification decision. Towards that direction, we plan to use tools from graph theory and graph signal processing [3]. The obtained results will be compared with classical feature visualization techniques [2]. The proposed algorithm will be tested on classical computer vision datasets such as ImageNet, as well as on medical cancer images.

Goals/Benefits:

Research experience in the emerging topic of interpretability/explainability of deep nets. If successful, the project will lead to a scientific publication.
Practical experience with state-of-the-art deep learning architectures.
Exposure to advanced optimization techniques.

Prerequisites:

Experience with deep learning frameworks such as PyTorch or Tensorflow.
Good knowledge of Python.
Knowledge of discrete optimization is a plus.
Motivation to work in a challenging research topic.

References:

[1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, pp. 436–444, 2015

[2] https://distill.pub/2018/building-blocks/

[3] D. I Shuman, S. K. Narang, P. Frossard, A. Ortega, and P. Vandergheynst. The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains. Signal Processing Magazine, IEEE, vol. 30, num. 3, p. 83-98, 2013.

[4] Nicolas Papernot and Patrick McDaniel, “Deep k-Nearest Neighbors: Towards Confident, Interpretable and Robust Deep Learning”, arXiv:1803.04765, 2018.

Contact: Dorina Thanou [email protected]

Laboratory: Swiss Data Science Center

Type: Master Thesis Project

Description:

Generative Adversarial Networks (GANs) are notoriouly difficult to train. This is because the min-max nature of the problem is inherently unstable. Training therefore requires plenty of tweaks. In this project a feedback mechanism will be tried regulate the GAN tranining. When the discriminator is significantly stronger than the generator, the learning rate and/or the training iterations of the discriminator will be controlled based on the generator loss, and vice versa. Such feeback will introduce a self-regulating property to GAN training.

Goals/benefits:

Address a wide-spread problem in GAN training
Improve deep learning skills
Opportunity to publish a scientific paper

Prerequisites:

Knowledge of deep learning and GAN’s
Coding in python using Pytorch and/or Tensorflow
Interested in solving practical problems

Contact: Radhakrishna Achanta [email protected]


Semester Projects – Autumn 2018


Laboratory: Swiss Data Science Center

Type: Semester Project

Description:

In personalized oncology, understanding biological cell interactions from tumor images is still an ongoing field of research. Nowadays, the analysis of histological sections (i.e., anatomy of cells and tissues) requires expert pathologists to define and understand complex patterns in the cells. Typically, the analysis frequently consists in studying rough numerical changes in the numbers of positive signals per cell which leaves many other important aspects of the tissue structure alterations hidden or very difficult to quantify.

The goal of this project is to study explainable and interpretable machine learning models for personalized oncology. In particular, we plan to analyze the tumors histology and study the clinical response of patients in order to properly adapt personalized treatments. Some important properties that we will consider are cell distribution, cell localization (size of clusters, polarity/anisotropy of cells within clusters), biomarker localization within cells, as well as connections between several images of the same tumors (that correspond to different markers, or different cross-sections). The goal of the machine learning algorithm should be to shed light on which of the above properties influence the clinical outcome. Different interpretable tree based machine learning algorithms such as gradient boosting will be studied.

Goals/Benefits:

Opportunity to be involved in a high-impactful and interdisciplinary project.
Experience in working with real medical data.

Prerequisites:

Good knowledge of machine learning.
Good knowledge of Python.
Willing to take initiatives and try advanced machine learning techniques for structured data.

Contact: Dorina Thanou [email protected]

Laboratory: Swiss Data Science Center

Type: Semester Project

Description:

The tremendous advancement in machine learning algorithms over the last few decades has accelerated the adoption of neural networks and deep learning architectures in many applications such as image classification, natural language processing, and human action recognition [1]. These recent methods have led to impressive performance that even come close to the ones of humans on certain recognition or classification tasks. Successful training of deep networks requires though many thousand annotated training data.

In this project, the goal is to segment melanoma cancer images of mice that have been exposed to immunotherapy. The goal is to detect cells, when very little training data is available. The project consists in two parts: (a) increasing the label set, and (b) segmenting the images with state-of-the art convolution neural networks.

Goals/Benefits:

Opportunity to be involved in an interdisciplinary project.
Experience in working with real medical data.
Prerequisites:

Familiar with deep learning frameworks.
Good knowledge of Python.
Good knowledge of image processing.
References:

[1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, pp. 436–444, 2015

[2] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional Networks for Biomedical Image Segmentation”, arXiv:1505.04597, 2015.

Contact: Dorina Thanou [email protected]

Laboratory: Swiss Data Science Center

Type: Master Thesis Project

Description:

The class activation maps (CAM) obtained by global average pooling [1] can be used to localize objects of a class spatially in an image [2]. In this project faces will be localized in images using this technique. In order to do so, images will be classified as contaning face or not (a binary classification problem) using deep learning. Then, using the class activation layers of the last convolutional layer, faces will be localized. The interesting challenge is to do so irrespective of the scale and orientation of the face(s). An existing database can be used for training or a new one created if needed.

Goals/benefits:

Solving a real world problem using deep learning
Improve deep learning skills
Opportunity to publish a scientific paper

Prerequisites:

Knowledge of deep learning
Coding in python using Pytorch and/or Tensorflow
Interested in solving practical problems
References:

[1] M. Lin, Q. Chen, and S. Yan. Network in network. ICLR, 2013.
[2] B.Zhou,A.Khosla,L.A.,A.Oliva,and A.Torralba. Learning Deep Features for Discriminative Localization. CVPR, 2016

Contact: Radhakrishna Achanta [email protected]


Master Projects Industry – 2018


Internship Number: 21828

Type: Master Internship, Master project (diploma) Length 4-6 months

Hiring time: Period 2: July – February (2018-2019)

Company information
Company (top): Bühler AG

Address: EPFL Innovation Park, Bldg. I

City: 1015 Lausanne, Switzerland

Work description
Description and objectives:

Bühler AG (www.buhlergroup.com) based in Uzwil, Switzerland, is a technology company building plants, equipment and related services for processing food and manufacturing of advanced materials. The organisation holds leading market positions in processes that transform different raw materials, such as grains, cocoa, coffee, – into flour, animal feeds, pasta, and chocolate. In addition, Bühler AG also produces die casting equipment and functional surface solutions. Bühler AG participates in numerous innovation initiatives; we are currently collaborating with the Swiss Data Science Centre and are a Diamond Sponsor of the Mass Challenge.

With all of the IoT data arising from our processing equipment there is an opportunity for data science. Bühler AG has an innovation satellite at the EPFL Innovation Park where we are establishing a data science hub. We currently have 4 full time data scientists and several interns. During the internship, you will be working under a full-time data scientist in a junior role. You will have an opportunity to work with time-series data from an industrial process and unsupervised machine learning techniques, the end goal being the deployment of an anomaly detection algorithm to a real-time dashboard for our customers.

Required skills:

We seek one innovative future data scientist, e.g. with a background in mathematics, statistics, physics, engineering or computer science, to develop this data analytic solution within an internship of 6 months. We believe that the key skills would be
* Enthusiasm for analysing data
* Good command over Python and its machine learning libraries

* An open communication style to interact with process experts and clients
* Coursework in Machine learning is an advantage (particularly sequential/time-series)

Languages: English (Intermediate)

Location: EPFL Innovation Park

Monthly salary: 2500

Remark

Conditions of registration

Masters & Domains of activity
Related masters Mathematical sciences, Physics, Computational science and engineering (Modeling, Algorithms, and HPC), Electrical & Electronic Engineering, Mechanical Engineering, Materials science and engineering, Microengineering, Computer science, Communication systems, Data science

[email protected]