Swiss Data Science Center

Projects – Autumn 2019


These projects are closed for applications

Laboratory: Swiss Data Science Center

Type: Master Thesis Project

Description:

The patterns of precipitation are changing at the regional level with changes in the mean amount, the extreme events that we are observing, etc. We are interested in understanding how the different moments of the probability distributions are changing over time. In order to do this, we will use unsupervised nonlinear dimension reduction techniques for feature extraction such as Laplacian eigenmaps. We would also like to predict future values of the Laplacian eigenvectors.

The data are three-dimensional (latitude*longitude*time) samples of precipitation from climate models. For example, one question we are interested in answering is: Will there be more extreme precipitation events in a certain region?

Goals/benefits:

– Working with machine learning techniques and time series analysis

– Working with machine learning libraries in Python (pandas, scikit-learn)

– Working with real-world data

– Advancing research on an interdisciplinary problem

– Possibility to publish a research paper

Prerequisites:

– Linear algebra

– Machine learning (intermediate skills)

– Python (intermediate skills)

– Interested in interdisciplinary applications

References:

[1] M. Belkin and P. Niyogi, “Laplacian eigenmaps and spectral techniques for embedding and clustering”, NIPS, 2001

[2] R.R. Coifman and S.Lafon, “Diffusion maps”, Applied and computational harmonic analysis, 2006

Contact: Eniko Szekely, [email protected]

Laboratory: Swiss Data Science Center

Type: Master Thesis Project

Description:

The Swiss Data Science Center (SDSC, datascience.ch) is a national center between EPFL and ETH Zurich, whose mission is to accelerate the use of data science and machine learning techniques broadly within academic disciplines of the ETH Domain and the Swiss academic community at large. Although there is no shortage of tools and methods in data science, the acquisition, access, and management of data represent one of the greatest challenges in the transition to the digital age. In order to leverage the use of restricted data (as opposed to open data) in a privacy-conscious manner, the SDSC is developing a Swiss Data Custodian: a novel secure multi-party computation environment enabling collaboration between non-trusted parties to make the best use of the data while preserving the sovereignty of its rightful owners. The Custodian conveys privacy, control, trust and transparency by design.

SDSC is seeking enthusiastic and pro-active students with proven experience in one of the following domains:

– Security

– Data privacy

– Decentralized systems

– Large-scale distributed platforms, services and applications

– Multiparty computing

This internship is about participating in the design and implementation of the open-source Reference Architecture of the Custodian, as well as working on a use case on urban mobility data and analytics.

Goals/benefits:

– Practical experience in developing complex large-scale software systems

– Becoming familiar with state-of-the-artsecurity and privacy technologies, or encryption technologies.

– Becoming familiar with state-of-the-art access control paradigms

– Working in an interactive and interdisciplinary research environment.

Prerequisites:

– Confirmed level experience in using Linux

– Beginner level experience with application containerization and orchestration

– Good programming skills with high-levelprogramming language such as Go, C, or C++

– Good software engineering skills

Contact: Eric Bouillet, [email protected] & Marc-edouard Schultheiss, [email protected]

Laboratory: Swiss Data Science Center

Type: Master Thesis Project

Description:

Unsupervised autoencoders [1,2] are used to reduce the dimensionality of the data in the absence of labeled data. The low-dimensional representation can be used for feature extraction, but also as input to other machine learning algorithms. Here, we will use the low-dimensional representations to estimate kernel similarities in the reduced space and compute the eigenvectors of an associated graph Laplacian [3]. The input to the autoencoder will be sequences of images capturing the spatiotemporal structure of the data.

We will apply this method to the analysis of satellite images of cloud cover (26 years of data). The eigenfunctions of the Laplacian computed in the original high-dimensional space capture physically meaningful patterns intrinsic to the atmosphere, such as the annual cycle, El Nino or diurnal cycle [4]. Our goal is to compare the graph Laplacian technique with the method outlined here, and verify if the dimension reduction using an autoencoder helps to improve the quality of the extracted signals.

Goals/benefits:

– Working with machine learning and deep learning libraries in Python (pandas, scikit-learn, PyTorch)

– Becoming familiar with the analysis of time series (power spectra, auto-correlation)

– Working with real-world satellite observations

– Advancing research on an interdisciplinary problem

– Possibility to publish a research paper

Prerequisites:

– Machine learning and deep learning (advanced or intermediate skills)

– Python (advanced skills)

– Interested in interdisciplinary applications

References:

[1] Tutorial autoencoders – http://ufldl.stanford.edu/tutorial/unsupervised/Autoencoders/

[2] G.E. Hinton and R.R. Salakhutdinov, “Reducing the dimensionality of data using neural networks”, Science, 2006

[3] M. Belkin and P. Niyogi, “Laplacian eigenmaps and spectral techniques for embedding and clustering”, NIPS, 2001

[4] E. Szekely, D. Giannakis, A.J. Majda, “Extraction and predictability of coherent intraseasonal signals in infrared brightness temperature data”, Climate Dynamics, 2016

Contact: Eniko Szekely, [email protected]

Laboratory: Swiss Data Science Center

Type: Master Thesis Project

Description:

The decreasing costs of molecular profiling has fueled the biomedical research community with a plethora of new types of biomedical data, enabling a breakthrough towards a more precise and personalized medicine. In addition to biomedical data, preventive and social medicine practitioners increasingly use environmental data, such as location or pollution.

However, the release and usage of these intrinsically highly sensitive data poses a new threat towards privacy.

The goal of this project is to design an evaluation framework to systematize the analysis of inference attacks that exploit biomedical data, such as the genome, but also environmental data collected by research institutes and hospitals. In this endeavor, you will make use of probabilistic graphical models or other machine-learning models and test your models with real datasets provided by the IUMSP (Institut Universitaire de Médecine Sociale et Préventive) at CHUV. Time permitting, you will also develop defense mechanisms to reduce the impact of these inference attacks while keeping high levels of utility for medical researchers.

Goals/Benefits:

– Becoming familiar with probabilistic/machine-learning models

– Access to real-life health-related datasets

– Gaining experience in fields of growing importance

– Working in an interdisciplinary research environment

Prerequisites:

– Good Python and/or Matlab skills

– Good background in probabilities and machine learning

– Being interested in working in a multidisciplinary environment

Contact: Mathias Humbert, [email protected]

Laboratory: Swiss Data Science Center

Type: Master Thesis Project

Description:

In order to enable reproducibility of scientific studies, the Swiss Data Science Center (SDSC) is developing a flexible and scalable platform called Renga, which automates data provenance recoding, maintenance and traceability in the form of a knowledge graph. Due to the potential sensitive datasets that used in studies performed with Renga, one of the key challenges is to provide all the aforementioned features while guaranteeing a high level of privacy.

The goal of this project is to evaluate the feasibility and risk of various types of inference attacks against SDSC’s knowledge graph. In particular, you will investigate whether metadata exposed through Renga can leak sensitive information and, if so, develop countermeasures to mitigate this risk. You will also study how outputs of machine-learning models can expose membership in the dataset used to train these models and provide defense mechanisms to reduce the impact of this attack.

Goals/Benefits:

– Acquiring knowledge on machine learning and privacy

– Practical experience with a real-world data science platform

– Gaining experience in fields of growing importance

Prerequisites:

– Good background in machine learning and/or security and privacy

– Good programming skills

More information:

Renku platform: https://datascience.ch/renku-platform/

Shokri et al., Membership Inference Attacks Against Machine Learning Models, IEEE S&P’17, https://arxiv.org/pdf/1610.05820.pdf

Contact: Mathias Humbert, [email protected]

Laboratory: Swiss Data Science Center

Type: Master Thesis Project

Description:

The goal of this project is to perform supervised classification of image classes in the order of 10’000 to 20’000 in number. Such classification can not be accomplished using conventional deep networks that can classify up to 1000 classes. In order to achieve high scalability, the idea is to generate short descripters for each input image using contrastive or triplet loss. Simultaneously, a descriptor, which serves as a cluster center, is learnt for each of the classes. The final classification is achieved by finding the closest cluster center to the descriptor of any input image. A large labeled database will be provided for this project.

Goals/benefits:

– Create and train a network, like DenseNet, for 32×32 and 64×64 sized images.

– Scale up progressively from CIFAR-100 to ILSVRC-1000 to 20’000 classes.

Prerequisites:

-Coding in Python

-Knowledge of Deep learning and Pytorch/Tensorflow libraries.

– Interest in solving real-world problems

Deliverables

– Well-documented, clean code

– Written report and oral presentation

Contact: Dorina Thanou, [email protected], Radhakrishna Achanta, [email protected] & Sofiane Sarni, [email protected]

Laboratory: Swiss Data Science Center

Description:

The data considered in this project are very high-resolution hyperspectral images obtained by analytical transmission electron microscopy in the laboratory of Cécile Hébert. The spectra belong to two families of signal: X-ray signals (EDXS) and electron energy-loss signals (EELS).

The sample is a rectangular slice of material composed of different phases, each phase having a unique composition which can be characterized via its spectrum.

The data consists of hyperspectral images obtained in the horizontal plane below the slice by sending a vertical electron beam through the material. The signal obtained at each pixel can be thought of as a very noisy and quantized version of the average spectrum of the material along the vertical axis at that location of the horizontal plane. The main data analysis problem here consists in estimating as precisely as possible the different phases of the material and their corresponding spectra, in order, among other aims, to estimate the rare elements present in each of the phases.

A family of natural unsupervised approaches from machine learning that is relevant to try and identify the different phases, and their different spectral signatures, is the family of methods based on matrix factorization models, which are extensions of principal components analysis, independent component analysis, non-negative matrix factorization and the like.

The objective of this project is to work on a particular dictionary learning formulation of the problem with structural constraints and regularizations (simplex constraint, and Laplacian regularization), to make an efficient implementation using block-proximal methods. The idea is to cast the problem so that the dictionary elements are the ideal spectra corresponding to each phase and that the decomposition coefficients are exactly the proportion of each phase present at each pixel in the sample.

The challenges that lie ahead beyond the design of an efficient algorithm are: to be able to choose or design a loss function that correctly models the noise in the physical system; to be able to separate residues or particles that do not belong to any the main phases; and to be able to automatically select the right number of phases.

Several extensions of the problem are possible. Among others, while the EDX spectrum of a given pixel is the linear superposition of the spectra of individual elements, this not true anymore for EELS. In particular, it is formed as a convolution of the different spectra, which can perhaps be leveraged to build a more complete model.

Goals/Benefits:

– Experience in using machine learning techniques to model data in physics

– Learning how to incorporate expert knowledge in specialized ML formulations

– Gaining proficient in hand-on use of optimization algorithms

Prerequisites:

– Machine learning course at the master level

– Optimization algorithms (in particular proximal algorithms)

– Proficiency in Python.

Advisors:

The student will be working under the guidance of Guillaume Obozinski, Deputy Chief Data Scientist at the Swiss Data Science Center, Prof. Cécile Hebert, Director of the Electron Spectrometry and Microscopy laboratory (LSME), and Hui Chen, PhD student at the LSME.

Contact: Guillaume Obozinski [email protected]

Laboratory: Swiss Data Science Center

Description:

The goal of the project is to use machine learning and signal processing techniques to extract and understand patterns in temperature data. These patterns will be further used to interpolate temperatures at a given location using the information contained in the neighborhood graph. The graph is to be built either in the geographical space or in the data space, e.g., relying on additional information such as altitude. The data are three-dimensional (latitude*longitude*time) samples of temperature (and possibly moisture) from climate models and real observations.

Goals/Benefits:

– Working with machine learning and signal processing techniques

– Working with machine learning libraries in Python (pandas, scikit-learn)

– Working with real-world data

– Advancing research on an interdisciplinary problem

– Possibility to publish a research paper

Prerequisites:

– Linear algebra

– Machine learning (intermediate skills)

– Python (intermediate skills)

– Interested in interdisciplinary applications

References:

[1] M. Belkin and P. Niyogi, “Laplacian eigenmaps and spectral techniques for embedding and clustering”, NIPS, 2001

[2] R.R. Coifman and S.Lafon, “Diffusion maps”, Applied and computational harmonic analysis, 2006

Contact: Eniko Szekely [email protected] and Dorina Thanou [email protected]

Laboratory: Swiss Data Science Center

Description:

The goal of this project is to perform a series of experiments related to the “mixup” idea of data augmentation in order to help with a potential research publication. The idea of “mixup” was first presented in the following paper:

“mixup: Beyond Empirical Risk Minimization”, Zhang et al., ICLR 2018

Goals/Benefits:

– Knowledge of deep learning and data augmentation

– Being a co-author in a research publication

Prerequisites:

– Coding in python using Pytorch (preferred)

– Interested in research

Contact: Guillaume Obozinsky, [email protected], Mathias Humbert, [email protected] and Radhakrishna Achanta, [email protected]

Laboratory: Swiss Data Science Center

Description:

The goal of this project is to detect the perpective lines and rectangles for building and use this to potentially improve facade parsing. The project will proceed in two steps. In the first step, all experiments will be done on artificial images for the detection of perpective lines and rectangles. In the second step, this method will be applied to real images.

Goals/Benefits:

– Knowledge of deep learning for semantic segmentation and region proposal

– Help civil engineers analyse building structure automatically

Prerequisites:

– Coding in python using Pytorch (preferred) and/or Tensorflow

– Interested in solving practical problems

Contact: Guillaume Obozinsky, [email protected] and Radhakrishna Achanta, [email protected]

Laboratory: Swiss Data Science Center

Description:

In the aftermath of an earthquake, depending on the severity and the type, buildings suffer from earthquakes. The goal of this project is to detect these cracks and rank them according to the degree of severity. In order to improve the results, we may add facade parsing task to the crack detection task.

Goals/Benefits:

– Prepare data and labels (if needed, since some labeled data is already available)

– Train a deep network to detect cracks

Prerequisites:

– Coding in Python

– Knowledge of Deep learning and Pytorch/Tensorflow libraries.

– Interest in solving real-world problems

Deliverables :

– Well-documented, clean code

– Written report and oral presentation

Reference :

– “U-Net: Convolutional Networks for Biomedical Image Segmentation” Ronneberger, Fischer, Brox (2015)

– “Mask R-CNN”, He et al. (2017)

Contact: Radhakrishna Achanta, [email protected]

 

Laboratories: Swiss Data Science Center, Dana-Farber Cancer Institute & Harvard T.H Chan School of Public Health

Type: Master Thesis Project

Description: With the decreasing cost of DNA sequencing, germline testing for inherited genetic susceptibility has become widely used. Multi-gene panels now routinely test 25 to 125 genes. Evidence on the types of cancer associated with susceptibility genes, and magnitude of increased risk from these mutations, is emerging rapidly. However, no comprehensive databases exist for both clinicians and researchers to access this information. Furthermore, the number of cancer gene associations is large, information is dispersed over a vast number of published studies, the quality of the studies is uneven, and the data presented are seldom directly applicable to precision prevention decisions, which require absolute risk. We created a proof-of-principle clinical decision support tool, ask2me (https://ask2me.org/), which gives clinicians access to patient-specific, actionable, absolute risk estimates. This project would perform research that would bring this proof-of-principle work to fruition by developing workflow pipelines and a web app. The project would involve the development of the informatics infrastructure needed to perform large-scale annotation, sharing, integration and analysis of public domain data, from published studies, as well as the development of web apps allowing clinicians and researchers access the information. We expect that this work will have a significant clinical and scientific impact by supporting personalized prevention decisions for individuals who test positive for genetic mutations. This work will also have a positive impact on the fields of genetics and epidemiology by supporting the interpretation of results and prioritization of new studies.

Goals/benefits:

– Improve on an existing clinical decision support tool (tool is already used clinically and has the potential to become more widely used)

– Advance research on an interdisciplinary problem

– Possibility of publishing a research paper, if desired

– Work closely with a team of researchers across multiple institutions with diverse expertise (clinicians, statisticians, epidemiologists, data scientists)

Prerequisites:

– R and Python

– Web frameworks such as Django or Flask

– Statistics knowledge can be beneficial but not required

Deliverables:

– Well-documented, clean code for a web app

– Written report and oral presentation

References:

Braun, Danielle, et al. “A clinical decision support tool to predict cancer risk for commonly tested cancer-related germline mutations.” Journal of genetic counseling 27.5 (2018): 1187-1199.

Bao, Yujia, et al. “Using Machine Learning and Natural Language Processing to Review and Classify the Medical Literature on Cancer Susceptibility Genes.” arXiv preprint arXiv:1904.12617 (2019).

Contact:

Danielle Braun ([email protected] ), Christine Choirat ([email protected])

 

Laboratories: Boston University School of Public Health, Swiss Data Science Center

Type: Master Thesis Project

Description: Finding the clear link between ambient air pollution in a certain location (or country) and ambient air pollution in neighboring locations (or countries) is challenging due to long-range pollution transport; air pollution moves through time and space and it could potentially impact ambient pollution and health at distant locations through complex physical, chemical, and atmospheric processes. Modern data science and statistical methods have been historically underused in the local impact assessment of neighboring air pollution, in part due to their inability to accommodate long-range pollution transport relying on the physical-complex chemical models which are deterministic and computationally expensive. This project would involve the integration with a recently developed set of computationally scalable tools that re-purpose an air parcel trajectory modeling technique from atmospheric science (called HYSPLIT) to model population exposure to ambient air pollution in neighboring locations (countries). We simulate `massive’ air mass trajectories arriving a given location via HYSPLIT and evaluate times/distances that each air mass trajectory spends/takes when it travels across a neighboring location. Then, we integrate these measures with modern statistical methods to evaluate the impact of air pollution in neighboring locations/countries. We expect a substantial advance in the use of data science to address various consequences of ambient air pollution, offering a new quantitative perspective to improve upon the field’s historical reliance on deterministic physical/chemical air quality model outputs.

Goals/benefits:

Advance research on an air pollution problem

Publishing a research paper in a high impact journal

Work closely with a team of researchers across multiple institutions with diverse expertise (statisticians, epidemiologists, data scientist)

Prerequisites:

PostGIS

R (not strongly required)

Statistics knowledge can be beneficial but not required

Deliverables:

Well-documented, clean outputs (e.g., excel files)

References:

Kim, C., Daniels, M. J., Hogan, J. W., Choirat, C., Zigler, C. M. Bayesian Methods for Multiple Mediators: Relating Principal Stratification and Causal Mediation in the Analysis

of Power Plant Emission Controls, awarded ASA 2017 Biometrics section travel award, Annals of Applied Statistics, In press.

Kim, C., Zigler, C. M., Daniels, M. J., Choirat, C., Roy, J. A., Bayesian Longitudinal Causal Inference in the Analysis of the Public Health Impact of Pollutant Emissions. arXiv preprintarXiv:1901.00908

Contact:

Chanmin Kim ([email protected])

Christine Choirat ([email protected])