Swiss Data Science Center

This page lists the Swiss Data Science Center projects available to EPFL students. The SDSC is a joint venture between EPFL and ETH Zurich. Its mission is to accelerate the adoption of data science and machine learning techniques within academic disciplines of the ETH Domain, the Swiss academic community at large, and the industrial sector. In particular, it addresses the gap between those who create data, those who develop data analytics and systems, and those who could potentially extract value from it. The center is composed of a large multi-disciplinary team of data and computer scientists, and experts in select domains, with offices in Lausanne and Zurich.
Visit our website



Projects – Autumn 2020

It may be possible to convert a thesis project into a semester project or extend a semester project to be suitable for a thesis project. If any of the present or past projects interests you, please feel free to contact us. We are always looking forward to meeting motivated and talented students who want to work on exciting projects.

Laboratory:
Swiss Data Science Center

Type:
Master Thesis/Semester Project

Description:
[ML domain of interest] Causality and interpretability for machine learning 

[Method of interest – generating counterfactuals] In order to see machine learning deployed in practice, especially in sensitive domains (such as medicine or engineering), we should build models that we can trust. One aspect that increases trustworthiness of a ML model is its interpretability.  We can provide interpretability by ‘explanations’ of the model that correctly capture how (and why) the model is making its predictions, instead of simply reflecting correlations. In this project we will explore the idea of using counterfactuals for this purpose. 

The interpretability community recently started using “counterfactual explanations” to explore how input variables can be modified to change a model’s output without making explicit causal assumptions [1].

For example, for a certain image predicted as ‘dog’, we will try to artificially change that same image, as little as possible, such that the classification label would flip to ‘cat’. The concepts/factors that lead to this change in prediction can be considered as explanations.

Recently [1, 2, 4, for examples see also 3] proposed methods to generate counterfactuals by leveraging generative models such as conditional Variational Autoencoders or Generative Adversarial Networks. The goal of this master/semester project would be to implement and compare the various approaches and evaluate their suitability for applications with real-world unstructured datasets. 

[Application of interest]

Initially the project will start with an attributed dataset [5].

Goals/Benefits:

  • Working with machine learning libraries in Python (pandas, scikit-learn)
  • Advancing research on an interdisciplinary problem
  • Possibility to publish a research paper

Prerequisites:

  • Linear algebra
  • Deep learning 
  • Python (intermediate skills) | PyTorch is a plus
  • Interested in interdisciplinary applications

Deliverables:

  • Well-documented code
  • Written report and oral presentation

References:

  1. Wachter, Sandra, Brent Mittelstadt, and Chris Russell. “Counterfactual explanations without opening the black box: Automated decisions and the GDPR.” Harv. JL & Tech. 31 (2017): 841.
  2. Goyal, Yash, Uri Shalit, and Been Kim. “Explaining Classifiers with Causal Concept Effect (CaCE).” arXiv preprint arXiv:1907.07165 (2019).
  3. Chapter 6, https://christophm.github.io/interpretable-ml-book/counterfactual.html
  4. Denton, Emily, et al. “Detecting bias with generative counterfactual face attribute augmentation.” arXiv preprint arXiv:1906.06439 (2019).
  5. Yang, Mengjiao, and Been Kim. “Benchmarking Attribution Methods with Relative Feature Importance.” arXiv (2019): arXiv-1907.

Contact:
[email protected]

Laboratory:
Swiss Data Science Center

Type:
Master Thesis/Semester Project

Description:
[ML domain of interest] Causality and interpretability for deep models

[Method of interest – causal feature importance] 

Complex models such as deep neural networks are difficult, in most cases impossible to interpret. As a response, recent works propose different approaches to account for this disadvantage, however, they are mostly based on empirical results and require modifications/adoption of the models to make them more interpretable. 

In [1] the authors propose CXPlain, a causal-inspired method for determining feature importance decoupled from the model of interest. This is done by training a separate model with a causal objective, with a goal to learn to estimate to what degree certain inputs cause outputs in another machine-learning model. Additionally, a resampling procedure provides uncertainty estimates for the selected features. This is a promising direction for the future of deep models regarding their trustworthiness and adaptation to sensitive domains.

[Application of interest]

The end goal of the project is to help in the selection of causal features for application of (deep) models in fluid dynamics, without impacting the predictive performance of the model itself.

References:

  1. Schwab, Patrick, and Walter Karlen. “CXPlain: Causal explanations for model interpretation under uncertainty.” Advances in Neural Information Processing Systems. 2019.

Contact:
[email protected]

Description:
The Swiss Data Science Center has developed a smartphone app to collect GPS data. This app is mainly used for educational and research purposes, to let students build their own project on real-time raw data, flowing from the app straight to our servers.
The motivation for this work it to understand the whole pipeline of data processing – from the sensing device to the server and eventually to a dashboard – while taking into account the limits and constraints posed by privacy regulations. Applications on raw mobility data are wide and various and require advanced data processing technics to be understandable and useful for the end-user (e.g. the traveler, transport planners, data scientists, etc.)

Goals/benefits:

  • Working with machine learning libraries in Python
  • Understand constraints of privacy in data science
  • Research-oriented exploration of mobility trajectories/patterns
  • Learn to work and communicate with other experts (developers, privacy experts,…)
  • Possibility to publish a research paper
Prerequisites:
  • Linear algebra
  • Machine learning (intermediate skills)
  • Python (intermediate skills)
  • Interest in interdisciplinary applications
  • Interest in Privacy
Deliverables:
  • Well-documented code
  • Written report and oral presentation

References:

  1. Schüssler, Nadine; Axhausen, KayW. Processing GPS raw data without additional information. ETH Zurich, January 2009. https://doi.org/10.3929/ethz-a-005652342
  2. Dominik Bucher, Francesca Mangili, Francesca Cellina, Claudio Bonesana, DavidJonietz, Martin Raubal From location tracking to personalized eco-feedback: A framework for geographic information collection, processing and visualization to promote sustainable mobility behaviors. Travel Behaviour and Society, January 2019. https://doi.org/10.1016/j.tbs.2018.09.005
  3. You, Linlin, Fang Zhao, Lynette Cheah, Kyungsoo Jeong, Pericles Christopher Zegras, et Moshe Ben-Akiva. A Generic Future Mobility Sensing System for Travel Data Collection, Management, Fusion, and VisualizationIEEE Transactions on Intelligent Transportation Systems, 2019, 1‑12. https://doi.org/10.1109/TITS.2019.2938828.

Contact: 
Marc-Edouard Schultheiss: [email protected]

Laboratory:
Swiss Data Science Center

Type:
Master Project

Description:
Variational autoencoders [1,2] are unsupervised deep learning techniques that learn latent representations of the input data that are of lower dimensionality, thus capturing the most relevant features in the data. The latent low-dimensional representations can then be used directly for understanding, but also as input to other machine learning algorithms, such as clustering or extreme event detection. Here, we will mostly focus in understanding the latent representations using a climate dataset. The goal will be to disentangle the representations, such that each feature captures one driver of the climate dynamics, e.g., climate change trend, seasonal cycle, daily cycle, etc. Disentangling these signals in not always easy, and one way of quantifying this will be by looking at the frequency spectrum of the signals. We will work with daily measurements of temperature either from a climate model or observations.

Goals/benefits:

  • Working with machine learning and deep learning libraries in Python (pandas, scikit-learn, PyTorch)
  • Becoming familiar with the analysis of time series (power spectra)
  • Advancing research on an interdisciplinary problem
  • Possibility to publish a research paper

Prerequisites:

  • Machine learning and deep learning (advanced or intermediate skills)
  • Python (advanced skills)
  • Interested in interdisciplinary applications

Deliverables:

  • Well-documented code
  • Written report and oral presentation

References:

  1. D. Kingma, M. Welling, “Auto-encoding variational Bayes”, 2013
  2. I. Higgins, D. Amos, D. Pfau, S. Racaniere, L. Matthey, D. Rezende, A. Lerchner, “Towards a definition of disentangled representations”, 2018  

Contact:
Eniko Székely [email protected]
Natasha Tagasovska [email protected]

Description: 
Dynamical systems such as the climate are highly nonlinear, and despite the fact that the observations are high-dimensional, most of the dynamics is captured by a small number of physically meaningful patterns. In this project we will apply unsupervised dimension reduction techniques for feature extraction, and more specifically, the Nonlinear Laplacian Spectral Analysis (NLSA) [1] approach to extract features from potential vorticity at the level of the stratosphere. NLSA uses the information on the past trajectory of the data and thus allows us to extract representative temporal and spatial patterns. We will compare the results with linear techniques such as Principal Component Analysis.

Goals/benefits:

  • Working with machine learning libraries in Python (pandas, scikit-learn)
  • Advancing research on an interdisciplinary problem
  • Possibility to publish a research paper

Prerequisites:

  • Linear algebra
  • Machine learning (intermediate skills)
  • Python (intermediate skills)
  • Interested in interdisciplinary applications

Deliverables:

  • Well-documented code
  • Written report and oral presentation

References:

  1. D. Giannakis and A.J. Majda. Nonlinear Laplacian Spectral Analysis: Capturing intermittent and low-frequency spatiotemporal patterns in high-dimensional data, Statistical Learning and Data Analysis, 2012
  2. E. Székely, D. Giannakis, A.J. Majda. Extraction and predictability of coherent intraseasonal signals in infrared brightness temperature dataClimate Dynamics, 2016
  3. M. Belkin and P. Niyogi, Laplacian eigenmaps and spectral techniques for embedding and clusteringNeurIPS, 2001

Contact: 
Eniko Szekely: [email protected]
Raphaël de Fondeville: [email protected]

Laboratories:

  • Swiss Data Science Center
  • Dana-Farber Cancer Institute
  • Harvard T.H Chan School of Public Health

Type:
Master Thesis Project

Description:
With the decreasing cost of DNA sequencing, germline testing for inherited genetic susceptibility has become widely used. Multi-gene panels now routinely test 25 to 125 genes. Evidence on the types of cancer associated with susceptibility genes, and magnitude of increased risk from these mutations, is emerging rapidly. However, no comprehensive databases exist for both clinicians and researchers to access this information. Furthermore, the number of cancer gene associations is large, information is dispersed over a vast number of published studies, the quality of the studies is uneven, and the data presented are seldom directly applicable to precision prevention decisions, which require absolute risk. We created a proof-of-principle clinical decision support tool, ask2me (https://ask2me.org/), which gives clinicians access to patient-specific, actionable, absolute risk estimates. This project would perform research that would bring this proof-of-principle work to fruition by developing workflow pipelines and a web app. The project would involve the development of the informatics infrastructure needed to perform large-scale annotation, sharing, integration and analysis of public domain data, from published studies, as well as the development of web apps allowing clinicians and researchers access the information. We expect that this work will have a significant clinical and scientific impact by supporting personalized prevention decisions for individuals who test positive for genetic mutations. This work will also have a positive impact on the fields of genetics and epidemiology by supporting the interpretation of results and prioritization of new studies.

Goals/benefits:

  • Improve on an existing clinical decision support tool (tool is already used clinically and has the potential to become more widely used)
  • Advance research on an interdisciplinary problem
  • Possibility of publishing a research paper, if desired
  • Work closely with a team of researchers across multiple institutions with diverse expertise (clinicians, statisticians, epidemiologists, data scientists)

Prerequisites:

  • R and Python
  • Web frameworks such as Django or Flask
  • Statistics knowledge can be beneficial but not required

Deliverables:

  • Well-documented, clean code for a web app
  • Written report and oral presentation

References:

  1. Braun, Danielle, et al. “A clinical decision support tool to predict cancer risk for commonly tested cancer-related germline mutations.” Journal of genetic counseling 27.5 (2018): 1187-1199.
  2. Bao, Yujia, et al. “Using Machine Learning and Natural Language Processing to Review and Classify the Medical Literature on Cancer Susceptibility Genes.” arXiv preprint arXiv:1904.12617 (2019).

Contact:
Danielle Braun ([email protected] )
Christine Choirat ([email protected])

Laboratories:

  • Boston University School of Public Health
  • Swiss Data Science Center

Type:
Master Thesis Project

Description:
Finding the clear link between ambient air pollution in a certain location (or country) and ambient air pollution in neighboring locations (or countries) is challenging due to long-range pollution transport; air pollution moves through time and space and it could potentially impact ambient pollution and health at distant locations through complex physical, chemical, and atmospheric processes. Modern data science and statistical methods have been historically underused in the local impact assessment of neighboring air pollution, in part due to their inability to accommodate long-range pollution transport relying on the physical-complex chemical models which are deterministic and computationally expensive. This project would involve the integration with a recently developed set of computationally scalable tools that re-purpose an air parcel trajectory modeling technique from atmospheric science (called HYSPLIT) to model population exposure to ambient air pollution in neighboring locations (countries). We simulate `massive’ air mass trajectories arriving a given location via HYSPLIT and evaluate times/distances that each air mass trajectory spends/takes when it travels across a neighboring location. Then, we integrate these measures with modern statistical methods to evaluate the impact of air pollution in neighboring locations/countries. We expect a substantial advance in the use of data science to address various consequences of ambient air pollution, offering a new quantitative perspective to improve upon the field’s historical reliance on deterministic physical/chemical air quality model outputs.

Goals/benefits:

  • Advance research on an air pollution problem
  • Publishing a research paper in a high impact journal
  • Work closely with a team of researchers across multiple institutions with diverse expertise (statisticians, epidemiologists, data scientist)

Prerequisites:

  • PostGIS
  • R (not strongly required)
  • Statistics knowledge can be beneficial but not required

Deliverables:
Well-documented, clean outputs (e.g., excel files)

References:

  1. Kim, C., Daniels, M. J., Hogan, J. W., Choirat, C., Zigler, C. M., “Bayesian Methods for Multiple Mediators: Relating Principal Stratification and Causal Mediation in the Analysis of Power Plant Emission Controls”, awarded ASA 2017 Biometrics section travel award, Annals of Applied Statistics, In press.
  2. Kim, C., Zigler, C. M., Daniels, M. J., Choirat, C., Roy, J. A., “Bayesian Longitudinal Causal Inference in the Analysis of the Public Health Impact of Pollutant Emissions.” arXiv preprintarXiv:1901.00908

Contact:
Chanmin Kim ([email protected])
Christine Choirat ([email protected])