Swiss Data Science Center

This page lists the Swiss Data Science Center projects available to EPFL students. The SDSC is a joint venture between EPFL and ETH Zurich. Its mission is to accelerate the adoption of data science and machine learning techniques within academic disciplines of the ETH Domain, the Swiss academic community at large, and the industrial sector. In particular, it addresses the gap between those who create data, those who develop data analytics and systems, and those who could potentially extract value from it. The center is composed of a large multi-disciplinary team of data and computer scientists, and experts in select domains, with offices in Lausanne and Zurich. datascience.ch



Projects – Spring 2020

These projects are closed for applications
 

Description:

The Swiss Data Science Center has developed a smartphone app to collect GPS data. This app is mainly used for educational and research purposes, to let students build their own project on real-time raw data, flowing from the app straight to our servers.
The motivation for this work it to understand the whole pipeline of data processing – from the sensing device to the server and eventually to a dashboard – while taking into account the limits and constraints posed by privacy regulations. Applications on raw mobility data are wide and various and require advanced data processing technics to be understandable and useful for the end-user (e.g. the traveler, transport planners, data scientists, etc.)

Goals/benefits:
  • Working with machine learning libraries in Python
  • Understand constraints of privacy in data science
  • Research-oriented exploration of mobility trajectories/patterns
  • Learn to work and communicate with other experts (developers, privacy experts,…)
  • Possibility to publish a research paper
Prerequisites:
  • Linear algebra
  • Machine learning (intermediate skills)
  • Python (intermediate skills)
  • Interest in interdisciplinary applications
  • Interest in Privacy
Deliverables:
  • Well-documented code
  • Written report and oral presentation

[1] Schüssler, Nadine; Axhausen, KayW. Processing GPS raw data without additional information. ETH Zurich, January 2009. https://doi.org/10.3929/ethz-a-005652342
[2] Dominik Bucher, Francesca Mangili, Francesca Cellina, Claudio Bonesana, DavidJonietz, Martin Raubal From location tracking to personalized eco-feedback: A framework for geographic information collection, processing and visualization to promote sustainable mobility behaviors. Travel Behaviour and Society, January 2019. https://doi.org/10.1016/j.tbs.2018.09.005
[3] You, Linlin, Fang Zhao, Lynette Cheah, Kyungsoo Jeong, Pericles Christopher Zegras, et Moshe Ben-Akiva. A Generic Future Mobility Sensing System for Travel Data Collection, Management, Fusion, and VisualizationIEEE Transactions on Intelligent Transportation Systems, 2019, 1‑12. https://doi.org/10.1109/TITS.2019.2938828.

Contact: Marc-Edouard Schultheiss: [email protected]

Description: Dynamical systems such as the climate are highly nonlinear, and despite the fact that the observations are high-dimensional, most of the dynamics is captured by a small number of physically meaningful patterns. In this project we will apply unsupervised dimension reduction techniques for feature extraction, and more specifically, the Nonlinear Laplacian Spectral Analysis (NLSA) [1] approach to extract features from potential vorticity at the level of the stratosphere. NLSA uses the information on the past trajectory of the data and thus allows us to extract representative temporal and spatial patterns. We will compare the results with linear techniques such as Principal Component Analysis.

Goals/benefits:

  • Working with machine learning libraries in Python (pandas, scikit-learn)
  • Advancing research on an interdisciplinary problem
  • Possibility to publish a research paper

Prerequisites:

  • Linear algebra
  • Machine learning (intermediate skills)
  • Python (intermediate skills)
  • Interested in interdisciplinary applications

Deliverables:

  • Well-documented code
  • Written report and oral presentation

[1] D. Giannakis and A.J. Majda. Nonlinear Laplacian Spectral Analysis: Capturing intermittent and low-frequency spatiotemporal patterns in high-dimensional data, Statistical Learning and Data Analysis, 2012

[2] E. Székely, D. Giannakis, A.J. Majda. Extraction and predictability of coherent intraseasonal signals in infrared brightness temperature data, Climate Dynamics, 2016

[3] M. Belkin and P. Niyogi, Laplacian eigenmaps and spectral techniques for embedding and clustering, NeurIPS, 2001

Contact: Eniko Szekely: [email protected] and Raphaël de Fondeville: [email protected]

Description: The goal of this project is to use machine learning and signal processing techniques to extract and understand patterns in temperature data. These patterns will be further used to interpolate temperatures at a given location using the information contained in the neighborhood graph. The graph is to be built either in the geographical space or in the data space, e.g., relying on additional information such as altitude. The data are three-dimensional (latitude*longitude*time) samples of temperature (and/or precipitation) from climate models and real observations.

Goals/benefits:

  • Working with machine learning and signal processing techniques
  • Working with machine learning libraries in Python (pandas, scikit-learn)
  • Working with real-world data
  • Advancing research on an interdisciplinary problem
  • Possibility to publish a research paper

Prerequisites:

  • Linear algebra
  • Machine learning (intermediate skills)
  • Python (intermediate skills)
  • Interested in interdisciplinary applications

Deliverables:

  • Well-documented code
  • Written report and oral presentation

[1] M. Belkin and P. Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering, NeurIPS, 2001

[2] D. Zhou, O. Bousquet, T.N. Lal, J. Weston, B. Schölkopf. Learning with local and global consistency, NeurIPS, 2003

[3] D. Shuman, S. Narang, P. Frossard, A. Ortega, P. Vandergheynst. The emerging field of signal processing on graphs, IEEE Signal Processing Magazine, 2013

Contact: Eniko Szekely [email protected] and Dorina Thanou [email protected]  

Description: Embedding techniques represent graph data captured by pairwise distances in a low-dimensional space. Here we will explore known embedding techniques such as t-Stochastic Neighbour Embedding (t-SNE) [1] to embed data from climate models, and compare the results with techniques such as Principal Component Analysis (PCA) or Multidimensional Scaling (MDS). The motivation for this work stems from recent research in climate science showing that climate models are not independent, and weighting schemes that give equal weight to all models are suboptimal for future predictions [2].

Goals/benefits:

  • Working with machine learning libraries in Python (pandas, scikit-learn)
  • Advancing research on an interdisciplinary problem
  • Possibility to publish a research paper

Prerequisites:

  • Linear algebra
  • Machine learning (intermediate skills)
  • Python (intermediate skills)
  • Interested in interdisciplinary applications

Deliverables:

  • Well-documented code
  • Written report and oral presentation

[1] L.J.P. van der Maaten and G.E. Hinton. Visualizing High-Dimensional Data Using t-SNE. Journal of Machine Learning Research, 2008.

[2] V. Eyring et al. Taking climate model evaluation to the next level. Nature Climate Change, 2019

Contact: Eniko Szekely: [email protected]

Laboratory: Swiss Data Science Center

Type: Semester project

Description:

The study of sports records has a long history in data science especially with the use of extreme value theory (e.g. Coles, 2001), a field of statistics describing the behaviour of rare extreme events. The question of the existence of a lower bound to the runtime of a marathon has been tackled by previous studies (e.g. Blest, 1996), but these results were limited both by the availability of data and the lack of flexible modelling techniques. We propose to exploit current available databases combined with recent modelling methodologies (Spearing et al., 2019) in order to (re)-assess the existence of a lower bound for the fastest runtime of a marathon, estimate the likelihood of a sub-two hours marathon in a registered event, and compute the expected waiting time before observing a sub-two hours marathon in an official race.

Goals benefit

– Getting a first experience with Extreme Value Theory: learn how to extrapolate the data above observed level of intensity.

– Experience working with a real data set and how to efficiently exploit it.

– Possibility to write a paper.

Prerequisites

– Basic of statistics and likelihood inference (ideally some notion of point processes).

– Willingness to learn R.

– Some notion of web scraping.

Deliverable

– Clean and documented code

– Report and blog post (or paper)

– Oral presentation

References

Blest, D. C. (1996). Lower Bounds for Athletic Performance. Journal of the Royal Statistical Society. Series D (The Statistician), 45(2):243–253.

Coles, S. G. (2001). An Introduction to Statistical Modeling of Extreme Values. Springer, London.

Spearing, H., Tawn, J., Irons, D., Paulden, T., and Bennett, G. (2019). Ranking, and Other Properties, of Elite Swimmers Using Extreme Value Theory. arXiv:1910.10070, pages 1–25.

Contact: Raphaël de Fondeville, [email protected]

Laboratory: Swiss Data Science Center

Type: Master Thesis Project

Description:

Multiplex networks (formally, multigraphs) have gained their popularity for representing complex multidimensional real-world systems. In particular, they help to generalize the simplest type of network, a graph, when the nodes, edges and edges weights represent different type of entities and relationships. Many real complex systems prove to have community structure, that is the nodes of the corresponding network may be grouped into internally densely connected clusters of nodes based on some criterion. Large trade networks are characterized by many interacting dimensions which undergo structural changes over time. We consider an example of a trade network with a tripartite structure. The main goal of the project is to develop community detection methods for dynamic tripartite trade network clustering in order to investigate the intrinsic structure of the network and its evolution.

Goals/benefits:

– Working with machine learning techniques, in particular, community detection techniques

– Working with high-dimensional real-world data

– Possibility to publish a research paper

Prerequisites:

– Knowledge of machine learning

– Python

– Interested in real-world applications

Deliverables:

– Well-documented code

– Written report and oral presentation

References:

Murata, T. (2010, April). Detecting communities from tripartite networks. In Proceedings of the 19th international conference on World wide web (pp. 1159-1160). ACM.

Zhuang, D., Chang, M. J., & Li, M. (2019). DynaMo: Dynamic Community Detection by Incrementally Maximizing Modularity. IEEE Transactions on Knowledge and Data Engineering.

Contact: Ekaterina Krymova, [email protected]

Laboratory: Swiss Data Science Center

Type: Master Thesis Project

Description:

SBM (stochastic block model) is a widely used probabilistic model for a random graph with clusters. The result of the SBM inference can give a description of the mechanisms of data generation, data structure and interrelations within clusters. We consider a big multidimensional trading network, which has companies and countries as vertices, the edges are transactions between the company and the country. Each of the dimension represents particular product, that is there may be multiple edges between each company-country pair, corresponding to different products. In this project we aim to perform inference of the stochastic block model based on the complex tensor trading data and analyse the inner structure of the data. Considering complex interrelations between companies, product groups and countries will help to understand fundamental characteristics of the economy and further quantify the consequences of adverse economical shocks.

Goals/benefits:

– Working with machine learning techniques, in particular, community detection techniques

– Working with high-dimensional real-world data

– Possibility to publish a research paper

Prerequisites:

– Knowledge of machine learning

– Python  

– Interested in real-world applications

Deliverables:

– Well-documented code

– Written report and oral presentation

References:

Abbe, Emmanuel. “Community detection and stochastic block models: recent developments.” The Journal of Machine Learning Research 18.1 (2017): 6446-6531.

De Bacco, C., Power, E. A., Larremore, D. B., & Moore, C. (2017). Community detection, link prediction, and layer interdependence in multilayer networks. Physical Review E, 95(4), 042317.

Contact: Ekaterina Krymova, [email protected]

Laboratory: Swiss Data Science Center

Type: Master Thesis Project

Description:

In economics, time series observations are often generated in a matrix form. For example, one may observe matrices with values of products exported to different countries every year by number of companies. Despite it seems to be straightforward to vectorize the matrices with observations and to use standard time series analysis methods, one might lose the information hidden in the rows and columns of the initial matrices. The goal of this project is to develop a MAR (matrix autoregression) method for modelling of trade processes. The conditions on the entries of the MAR matrices, which arise from the properties of the data at hand, suggest the use of regularized techniques for MAR estimation. Further analysis of the structure of the coefficient matrices will help to get insight into interrelations between the products and the countries and their influence on the trade activity.

Goals/benefits:

– Working with machine learning techniques, time series analysis

– Working with real-world data

– Possibility to publish a research paper

Prerequisites:

– Knowledge of machine learning

– Python  

– Interested in real-world applications

Deliverables:

– Well-documented code

– Written report and oral presentation

References:

Chen, R., Xiao, H., & Yang, D. (2018). Autoregressive Models for Matrix-Valued Time Series. arXiv preprint arXiv:1812.08916.

Bunch, J. R., & Rose, D. J. (Eds.). (2014). Sparse matrix computations. Academic Press.

Contact: Ekaterina Krymova, [email protected]

Laboratory: Swiss Data Science Center

Type: Master Thesis Project

Project summary:

A common approach to describe and represent time series is to use a stochastic generative approach such as a Linear Dynamical System (LDS), where the dynamics of the observed sequence are assumed to be governed by the evolution of some latent variable. Such models have been successfully applied to several problems in computer vision like synthesis and classification of dynamic textures, action recognition and segmentation. Capturing the dynamics of complex time series such as video sequences using linear models is particularly challenging due to the presence of several nuisance factors not necessarily related to the system’s dynamics. It is then of practical importance for some applications to consider non-linear dimensionality reduction and system identification techniques. In this project, we address fundamental problems related to time series modeling such as clustering, classification, and (non-linear) system identification. Our approach will exploit convolutional neural networks such as auto-encoders as a non-linear dimensionality reduction mechanism to extract latent representations that can be modeled from an LDS formalism.

Goals/benefits:

Work on a real and challenging problem of time-series modeling for applications such as activity recognition, anomaly detection, or forecasting.

Develop a toolbox for non-linear system identification based on CNNs.

Opportunity to publish a research article.

Prerequisites:

Knowledge of signal processing and machine learning. Familiarity with time-series modeling and convolutional neural networks is a plus.

Coding in python using PyTorch.

References:

[1] L Zappella, B Béjar, G Hager, and R Vidal. Surgical gesture classification from video and kinematic data. Medical Image Analysis, 17(7):732 – 745, 2013.

[2] B. Afsari and R. Vidal. Distances on Spaces of High-Dimensional Linear Stochastic Processes: A Survey. In Geometric Theory of Information, Signals and Communication Technology, pp. 219–242, Spinger-Verlag, 2014.

Contact : Guillaume Obozinski, [email protected] & Benjamìn Béjar Haro, [email protected]

Laboratory: Swiss Data Science Center

Type: Master Thesis Project

Project summary:

Determining the wiring pattern of neurons is a necessary step towards understanding how information is encoded and processed in the brain. When attempting to infer the brain’s connectivity with modern techniques, one has to deal with noisy and low temporal resolution neural activity recordings such as those coming from calcium imaging. This makes it difficult to estimate the firing instants of individual neurons, especially when firing events are closely spaced. In this project, we aim to develop high-resolution methods for neural firing activity inference from low-resolution electrical activity recordings obtained via calcium imaging.

Goals/benefits:

Work on a real problem that is at the forefront of neuroscience research

Develop a toolbox for spike inference estimation

Opportunity to publish a research article

Prerequisites:

Knowledge of signal processing and machine learning. Familiarity with sparse signal recovery methods is a plus.

Coding in python using PyTorch.

Interest in solving practical problems.

References:

[1] C Stosiek, O Garaschuk, K Holthoff, and A Konnerth. In vivo two-photon calcium imaging of neuronal networks. Proceedings of the National Academy of Sciences, 100(12):7319–7324, 2003.

[2] JT Vogelstein, AM Packer, and TA Machado. Fast nonnegative deconvolution for spike train inference from population calcium imaging. Journal of Neurophysiology, 104(6):3691–3704, 2010.

[3] B Béjar Haro and M Vetterli. Sampling continuous-time sparse signals: A frequency-domain perspective. IEEE Transactions on Signal Processing, 66(6):1410–1424, March 2018.

Contact : Guillaume Obozinski, [email protected] & Benjamìn Béjar Haro, [email protected]

Laboratories: Swiss Data Science Center, Dana-Farber Cancer Institute & Harvard T.H Chan School of Public Health

 

Type: Master Thesis Project

 

Description: With the decreasing cost of DNA sequencing, germline testing for inherited genetic susceptibility has become widely used. Multi-gene panels now routinely test 25 to 125 genes. Evidence on the types of cancer associated with susceptibility genes, and magnitude of increased risk from these mutations, is emerging rapidly. However, no comprehensive databases exist for both clinicians and researchers to access this information. Furthermore, the number of cancer gene associations is large, information is dispersed over a vast number of published studies, the quality of the studies is uneven, and the data presented are seldom directly applicable to precision prevention decisions, which require absolute risk. We created a proof-of-principle clinical decision support tool, ask2me (https://ask2me.org/), which gives clinicians access to patient-specific, actionable, absolute risk estimates. This project would perform research that would bring this proof-of-principle work to fruition by developing workflow pipelines and a web app. The project would involve the development of the informatics infrastructure needed to perform large-scale annotation, sharing, integration and analysis of public domain data, from published studies, as well as the development of web apps allowing clinicians and researchers access the information. We expect that this work will have a significant clinical and scientific impact by supporting personalized prevention decisions for individuals who test positive for genetic mutations. This work will also have a positive impact on the fields of genetics and epidemiology by supporting the interpretation of results and prioritization of new studies.

 

Goals/benefits:

 

– Improve on an existing clinical decision support tool (tool is already used clinically and has the potential to become more widely used)

– Advance research on an interdisciplinary problem

– Possibility of publishing a research paper, if desired

– Work closely with a team of researchers across multiple institutions with diverse expertise (clinicians, statisticians, epidemiologists, data scientists)

 

Prerequisites:

 

– R and Python

– Web frameworks such as Django or Flask

– Statistics knowledge can be beneficial but not required

 

Deliverables:

 

– Well-documented, clean code for a web app

– Written report and oral presentation

 

References:

 

Braun, Danielle, et al. “A clinical decision support tool to predict cancer risk for commonly tested cancer-related germline mutations.” Journal of genetic counseling 27.5 (2018): 1187-1199.

 

Bao, Yujia, et al. “Using Machine Learning and Natural Language Processing to Review and Classify the Medical Literature on Cancer Susceptibility Genes.” arXiv preprint arXiv:1904.12617 (2019).

 

Contact:

 

Danielle Braun ([email protected] ), Christine Choirat ([email protected])

Laboratories: Boston University School of Public Health, Swiss Data Science Center

Type: Master Thesis Project

Description: Finding the clear link between ambient air pollution in a certain location (or country) and ambient air pollution in neighboring locations (or countries) is challenging due to long-range pollution transport; air pollution moves through time and space and it could potentially impact ambient pollution and health at distant locations through complex physical, chemical, and atmospheric processes. Modern data science and statistical methods have been historically underused in the local impact assessment of neighboring air pollution, in part due to their inability to accommodate long-range pollution transport relying on the physical-complex chemical models which are deterministic and computationally expensive. This project would involve the integration with a recently developed set of computationally scalable tools that re-purpose an air parcel trajectory modeling technique from atmospheric science (called HYSPLIT) to model population exposure to ambient air pollution in neighboring locations (countries). We simulate `massive’ air mass trajectories arriving a given location via HYSPLIT and evaluate times/distances that each air mass trajectory spends/takes when it travels across a neighboring location. Then, we integrate these measures with modern statistical methods to evaluate the impact of air pollution in neighboring locations/countries. We expect a substantial advance in the use of data science to address various consequences of ambient air pollution, offering a new quantitative perspective to improve upon the field’s historical reliance on deterministic physical/chemical air quality model outputs.

Goals/benefits:

Advance research on an air pollution problem

Publishing a research paper in a high impact journal

Work closely with a team of researchers across multiple institutions with diverse expertise (statisticians, epidemiologists, data scientist)

Prerequisites:

PostGIS

R (not strongly required)

Statistics knowledge can be beneficial but not required

Deliverables:

Well-documented, clean outputs (e.g., excel files)

References:

Kim, C., Daniels, M. J., Hogan, J. W., Choirat, C., Zigler, C. M. Bayesian Methods for Multiple Mediators: Relating Principal Stratification and Causal Mediation in the Analysis

of Power Plant Emission Controls, awarded ASA 2017 Biometrics section travel award, Annals of Applied Statistics, In press.

Kim, C., Zigler, C. M., Daniels, M. J., Choirat, C., Roy, J. A., Bayesian Longitudinal Causal Inference in the Analysis of the Public Health Impact of Pollutant Emissions. arXiv preprintarXiv:1901.00908

Contact:

Chanmin Kim ([email protected])

Christine Choirat ([email protected])