Swiss Data Science Center

This page lists the Swiss Data Science Center projects available to EPFL students. The SDSC is a joint venture between EPFL and ETH Zurich. Its mission is to accelerate the adoption of data science and machine learning techniques within academic disciplines of the ETH Domain, the Swiss academic community at large, and the industrial sector. In particular, it addresses the gap between those who create data, those who develop data analytics and systems, and those who could potentially extract value from it. The center is composed of a large multi-disciplinary team of data and computer scientists, and experts in select domains, with offices in Lausanne and Zurich. datascience.ch



Projects – Spring 2020

It may be possible to convert a thesis project into a semester project or extend a semester project to be suitable for a thesis project. If any of the present or past projects interests you, please feel free to contact us. We are always looking forward to meeting motivated and talented students who want to work on exciting projects.

 

Laboratory: Swiss Data Science Center

 

Type: Master Thesis Project

 

Project summary:

 

A common approach to describe and represent time series is to use a stochastic generative approach such as a Linear Dynamical System (LDS), where the dynamics of the observed sequence are assumed to be governed by the evolution of some latent variable. Such models have been successfully applied to several problems in computer vision like synthesis and classification of dynamic textures, action recognition and segmentation. Capturing the dynamics of complex time series such as video sequences using linear models is particularly challenging due to the presence of several nuisance factors not necessarily related to the system’s dynamics. It is then of practical importance for some applications to consider non-linear dimensionality reduction and system identification techniques. In this project, we address fundamental problems related to time series modeling such as clustering, classification, and (non-linear) system identification. Our approach will exploit convolutional neural networks such as auto-encoders as a non-linear dimensionality reduction mechanism to extract latent representations that can be modeled from an LDS formalism.

 

Goals/benefits:

 

Work on a real and challenging problem of time-series modeling for applications such as activity recognition, anomaly detection, or forecasting.

Develop a toolbox for non-linear system identification based on CNNs.

Opportunity to publish a research article.

 

Prerequisites:

 

Knowledge of signal processing and machine learning. Familiarity with time-series modeling and convolutional neural networks is a plus.

Coding in python using PyTorch.

 

References:

 

[1] L Zappella, B Béjar, G Hager, and R Vidal. Surgical gesture classification from video and kinematic data. Medical Image Analysis, 17(7):732 – 745, 2013.

[2] B. Afsari and R. Vidal. Distances on Spaces of High-Dimensional Linear Stochastic Processes: A Survey. In Geometric Theory of Information, Signals and Communication Technology, pp. 219–242, Spinger-Verlag, 2014.

 

Contact : Guillaume Obozinski, [email protected]

Laboratory: Swiss Data Science Center

 

Type: Master Thesis Project

 

Project summary:

 

Determining the wiring pattern of neurons is a necessary step towards understanding how information is encoded and processed in the brain. When attempting to infer the brain’s connectivity with modern techniques, one has to deal with noisy and low temporal resolution neural activity recordings such as those coming from calcium imaging. This makes it difficult to estimate the firing instants of individual neurons, especially when firing events are closely spaced. In this project, we aim to develop high-resolution methods for neural firing activity inference from low-resolution electrical activity recordings obtained via calcium imaging.

 

Goals/benefits:

 

Work on a real problem that is at the forefront of neuroscience research

Develop a toolbox for spike inference estimation

Opportunity to publish a research article

 

Prerequisites:

 

Knowledge of signal processing and machine learning. Familiarity with sparse signal recovery methods is a plus.

Coding in python using PyTorch.

Interest in solving practical problems.

 

References:

 

[1] C Stosiek, O Garaschuk, K Holthoff, and A Konnerth. In vivo two-photon calcium imaging of neuronal networks. Proceedings of the National Academy of Sciences, 100(12):7319–7324, 2003.

[2] JT Vogelstein, AM Packer, and TA Machado. Fast nonnegative deconvolution for spike train inference from population calcium imaging. Journal of Neurophysiology, 104(6):3691–3704, 2010.

[3] B Béjar Haro and M Vetterli. Sampling continuous-time sparse signals: A frequency-domain perspective. IEEE Transactions on Signal Processing, 66(6):1410–1424, March 2018.

 

Contact : Guillaume Obozinski, [email protected]

Laboratories: Swiss Data Science Center, Dana-Farber Cancer Institute & Harvard T.H Chan School of Public Health

Type: Master Thesis Project

Description: With the decreasing cost of DNA sequencing, germline testing for inherited genetic susceptibility has become widely used. Multi-gene panels now routinely test 25 to 125 genes. Evidence on the types of cancer associated with susceptibility genes, and magnitude of increased risk from these mutations, is emerging rapidly. However, no comprehensive databases exist for both clinicians and researchers to access this information. Furthermore, the number of cancer gene associations is large, information is dispersed over a vast number of published studies, the quality of the studies is uneven, and the data presented are seldom directly applicable to precision prevention decisions, which require absolute risk. We created a proof-of-principle clinical decision support tool, ask2me (https://ask2me.org/), which gives clinicians access to patient-specific, actionable, absolute risk estimates. This project would perform research that would bring this proof-of-principle work to fruition by developing workflow pipelines and a web app. The project would involve the development of the informatics infrastructure needed to perform large-scale annotation, sharing, integration and analysis of public domain data, from published studies, as well as the development of web apps allowing clinicians and researchers access the information. We expect that this work will have a significant clinical and scientific impact by supporting personalized prevention decisions for individuals who test positive for genetic mutations. This work will also have a positive impact on the fields of genetics and epidemiology by supporting the interpretation of results and prioritization of new studies.

Goals/benefits:

– Improve on an existing clinical decision support tool (tool is already used clinically and has the potential to become more widely used)

– Advance research on an interdisciplinary problem

– Possibility of publishing a research paper, if desired

– Work closely with a team of researchers across multiple institutions with diverse expertise (clinicians, statisticians, epidemiologists, data scientists)

Prerequisites:

– R and Python

– Web frameworks such as Django or Flask

– Statistics knowledge can be beneficial but not required

Deliverables:

– Well-documented, clean code for a web app

– Written report and oral presentation

References:

Braun, Danielle, et al. “A clinical decision support tool to predict cancer risk for commonly tested cancer-related germline mutations.” Journal of genetic counseling 27.5 (2018): 1187-1199.

Bao, Yujia, et al. “Using Machine Learning and Natural Language Processing to Review and Classify the Medical Literature on Cancer Susceptibility Genes.” arXiv preprint arXiv:1904.12617 (2019).

Contact:

Danielle Braun ([email protected] ), Christine Choirat ([email protected])

Laboratories: Boston University School of Public Health, Swiss Data Science Center

Type: Master Thesis Project

Description: Finding the clear link between ambient air pollution in a certain location (or country) and ambient air pollution in neighboring locations (or countries) is challenging due to long-range pollution transport; air pollution moves through time and space and it could potentially impact ambient pollution and health at distant locations through complex physical, chemical, and atmospheric processes. Modern data science and statistical methods have been historically underused in the local impact assessment of neighboring air pollution, in part due to their inability to accommodate long-range pollution transport relying on the physical-complex chemical models which are deterministic and computationally expensive. This project would involve the integration with a recently developed set of computationally scalable tools that re-purpose an air parcel trajectory modeling technique from atmospheric science (called HYSPLIT) to model population exposure to ambient air pollution in neighboring locations (countries). We simulate `massive’ air mass trajectories arriving a given location via HYSPLIT and evaluate times/distances that each air mass trajectory spends/takes when it travels across a neighboring location. Then, we integrate these measures with modern statistical methods to evaluate the impact of air pollution in neighboring locations/countries. We expect a substantial advance in the use of data science to address various consequences of ambient air pollution, offering a new quantitative perspective to improve upon the field’s historical reliance on deterministic physical/chemical air quality model outputs.

Goals/benefits:

Advance research on an air pollution problem

Publishing a research paper in a high impact journal

Work closely with a team of researchers across multiple institutions with diverse expertise (statisticians, epidemiologists, data scientist)

Prerequisites:

PostGIS

R (not strongly required)

Statistics knowledge can be beneficial but not required

Deliverables:

Well-documented, clean outputs (e.g., excel files)

References:

Kim, C., Daniels, M. J., Hogan, J. W., Choirat, C., Zigler, C. M. Bayesian Methods for Multiple Mediators: Relating Principal Stratification and Causal Mediation in the Analysis

of Power Plant Emission Controls, awarded ASA 2017 Biometrics section travel award, Annals of Applied Statistics, In press.

Kim, C., Zigler, C. M., Daniels, M. J., Choirat, C., Roy, J. A., Bayesian Longitudinal Causal Inference in the Analysis of the Public Health Impact of Pollutant Emissions. arXiv preprintarXiv:1901.00908

Contact:

Chanmin Kim ([email protected])

Christine Choirat ([email protected])