The following projects are available for Master and Bachelor students. They are performed in close collaboration with an experienced member of the lab. Apply for a project by sending an email to the contact mentioned for the project.
You may also suggest new projects, ideally close enough to our ongoing, or previously completed projects. In that case, you will have to convince Anne-Marie Kermarrec that it is worthwhile, of reasonable scope, and that someone in the lab can mentor you!
Scalable and Fast Visualizations of the Behaviour of Decentralized ML Algorithms
The implementation of decentralized machine learning algorithms adds a spatial dimension where the performance of individual nodes varies. Good interactive visualizations, implemented with libraries such as D3.js, can help quickly identify local behaviour in specific examples that could be missed by theoretical approaches or statistical analysis of execution traces. The goal of this project is to design and implement tools to visualize the behaviour of dynamic large scale networks . The tools therefore need to support a large number of nodes and edges, filtering of information, multiple concurrent visualizations of the same data, and ideally interactive modification.
The project consists in designing and implementing one or multiple new visualizations, or improving the scalability of existing ones, with documentation explaining the design and how to extend/improve it.
Trust vs Obfuscate: Comparing the Scalability Potential of Privacy-Preserving Summary Aggregators
As sensing devices become widespread, privacy concerns also increase. Not knowing who can get access to private user data and with which intent can negatively seal the fate of computer systems in terms of adoption and adherence to laws.
Private and secure distributed computing is either done by data manipulation that prevents individual identification or by relying on trusted entities to perform computations. Trusted entities can be one’s own device, data centers or other peers. Those, in turn, may employ cryptographic primitives that mathematically ensure data integrity and confidentiality assuming bug-free implementations and deployments.
Federated Learning is a model that relies on no central entity that stores raw user data. Each edge node locally updates a shared model based on its own private data and produces a summary of the changes that these data induce to the model. A collection of such summaries are then aggregated to produce the next version of the shared model.
However effective in eliminating the centralized aggregator, the summaries may reveal users’ personal traits or preferences. To prevent that, this project aims at evaluating cryptographic approaches that would guarantee that summaries aggregation happen in a secure and private environment.
Trusted execution environments (such as Intel SGX) provide cryptographic primitives built in hardware that provide secure enclaves for processing data, but are limited in terms of memory usage. Homomorphic encryption solutions, on the other hand, perform computation on ciphertexts and produce encrypted results, but they involve computationally complex operations.
The goal of this project is to evaluate these approaches, considering their distinct threat models and practical limitations.
The project consists in designing and implementing a federated machine learning summary aggregator both with Intel SGX using its software development kit and another with homomorphic encryption libraries (e.g., https://github.com/shaih/HElib) and evaluate how they perform in terms of scalability.
Experimental benchmark of Large-Scale Hash-based kNN Graph Construction Algorithms
Contact: Anne-Marie Kermarrec.
k-Nearest-Neighbour (kNN) graphs play a crucial role in a large range of Big Data and AI applications, from recommenders to manifold learning. Constructing a full kNN graph remains, however, a computing-intensive task, for which many approximate heuristics have been proposed. The minHash algorithm coupled with Locality Sensitive Hashing is one such particularly successful method for user x item datasets using Jaccard’s coefficient as similarity measure. Several improvements have been proposed to the minHash algorithm in recent years, yet most of these proposals have only been analysed formally, often under assumptions that are not necessarily met in practice, or have only been used in the context of the (related but different) kNN search problem.
The goal of this internship is to experimentally benchmark these existing methods on concrete large-scale datasets, with the objective to chart the design space of optimized user x item hashing techniques for the construction of kNN graphs, and if time permits propose new approaches informed from experiment-driven insights.
Aumüller, Martin, Christiani Tobias, Pagh Rasmus, Vesterli Michael. “PUFFINN: Parameterless and Universally Fast FInding of Nearest Neighbors”. ESA 2019: 10:1-10:16
Christiani, Tobias, Rasmus Pagh, and Johan Sivertsen. “Scalable and robust set similarity join.” 2018 IEEE 34th International Conference on Data Engineering (ICDE). IEEE, 2018.
Dahlgaard, Søren, Mathias Bæk Tejs Knudsen, and Mikkel Thorup. “Fast similarity sketching.” 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS). IEEE, 2017
Guerraoui, Rachid, Kermarrec, Anne-Marie, Ruas Olivier, and Taïani François “Smaller, Faster & Lighter KNN Graph Constructions.” WWW 2020: 1060-1070
Ping, Li, and König, Arnd Christian. “Theory and applications of b-bit minwise hashing.” Communications of the ACM 54.8 (2011): 101-109.
Stanojevic, Rade, Mohamed Nabeel, and Ting Yu. “Distributed cardinality estimation of set operations with differential privacy.” 2017 IEEE Symposium on Privacy-Aware Computing (PAC). IEEE, 2017.