Student Projects

The following projects are available for Master and Bachelor students. They are performed in close collaboration with an experienced member of the lab. Apply for a project by sending an email to the contact mentioned for the project.

You may also suggest new projects, ideally close enough to our ongoing, or previously completed projects. In that case, you will have to convince Anne-Marie Kermarrec that it is worthwhile, of reasonable scope, and that someone in the lab can mentor you!

Summary

Experimental benchmark of Large-Scale Hash-based kNN Graph Construction Algorithms

Experimental benchmark of Large-Scale Hash-based kNN Graph Construction Algorithms

Contact: Anne-Marie Kermarrec.

Context

k-Nearest-Neighbour (kNN) graphs play a crucial role in a large range of Big Data and AI applications, from recommenders to manifold learning. Constructing a full kNN graph remains, however, a computing-intensive task, for which many approximate heuristics have been proposed. The minHash algorithm coupled with Locality Sensitive Hashing is one such particularly successful method for user x item datasets using Jaccard’s coefficient as similarity measure. Several improvements have been proposed to the minHash algorithm in recent years, yet most of these proposals have only been analysed formally, often under assumptions that are not necessarily met in practice, or have only been used in the context of the (related but different) kNN search problem.

Objective

The goal of this internship is to experimentally benchmark these existing methods on concrete large-scale datasets, with the objective to chart the design space of optimized user x item hashing techniques for the construction of kNN graphs, and if time permits propose new approaches informed from experiment-driven insights.

References 

Aumüller, Martin,  Christiani Tobias, Pagh Rasmus, Vesterli Michael. “PUFFINN: Parameterless and Universally Fast FInding of Nearest Neighbors”. ESA 2019: 10:1-10:16

Christiani, Tobias, Rasmus Pagh, and Johan Sivertsen. “Scalable and robust set similarity join.” 2018 IEEE 34th International Conference on Data Engineering (ICDE). IEEE, 2018.

Dahlgaard, Søren, Mathias Bæk Tejs Knudsen, and Mikkel Thorup. “Fast similarity sketching.” 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS). IEEE, 2017

Guerraoui, Rachid, Kermarrec, Anne-Marie,  Ruas Olivier, and  Taïani François “Smaller, Faster & Lighter KNN Graph Constructions.” WWW 2020: 1060-1070

Ping, Li, and König, Arnd Christian. “Theory and applications of b-bit minwise hashing.” Communications of the ACM 54.8 (2011): 101-109.

Stanojevic, Rade, Mohamed Nabeel, and Ting Yu. “Distributed cardinality estimation of set operations with differential privacy.” 2017 IEEE Symposium on Privacy-Aware Computing (PAC). IEEE, 2017.