This course is over. Here is a list of the cool projects developed by the students in this course. All project ideas were proposed by students.
Projects with a public repo:
Algorithmic Trading Strategies
by Jelena Antić, Paulina Grnarova, Filip Hrisafov (Team Leader), Vidor Kanalas, Miloš Stojanović, and Alexios Voulimeneas
(https://github.com/bigdata-trading/algo-trading-strategies, final report, presentation)
Assisted Music Promotion Tool
by Hajar Abbadi, Amine Benabdeljalil, Hind Benbihi, Mikael Castellani, Arthur Giroux (Team Leader), Valentin Matter, Dana Naous, and Abdessalam Ouaazki.
(https://github.com/arthurgiroux/assisted-promotion-tool, final report; presentation)
Bitcoin pricing prediction and trading simulation through time series and sentiment analysis
by Jonathan Cheseaux (Team Leader), Ilia Kebets, Fabien Schmitt, Igor Vokatch, and Marzell Camenzind
(https://github.com/cheseaux/BitcoinTradingSystem; final report, http://wiki.epfl.ch/bitcoin/)
- CodecWatch: A video encoding benchmark platform
by Guillaume Martres, Axel Angel, and Luca La Spada
- Crosstalk / TweetAggregator
by Kevin Serrano, Gianni Scarnera, Clement Moutet, Timo Babst, Pierre Gouedard, Adrien Ghosn, Joris Beau, Mathieu Demarne (Team Leader), Cedric Bastin, and Lewis Brown
(https://github.com/TweetAggregator/TweetAggregator; final report)
- DevMine – Evaluating developer skills and potential based on their open-source contributions
by Robin Hahling (Team Leader), Kevin Gillieron, Laurent Weingart, Hoai Xuan Luong, Frederik Galle, Daniel Espino Timón, and Clément Nicolas Doucet
- Humanitas – Predicting commodity prices in India/Indonesia through Time Series- and Social Media Analysis
by Gabriel Grill, Ching-Chia Wang, Joseph Boyd, Stefan Mihaila, Duy Nguyen, Anton Ovchinnikov, Alexander Busser, Julien Graisse, and Fabian Brix (Team Leader)
(https://github.com/fab-4-dev/humanitas; final report, EPFL press release)
by Amit Gupta (Team Leader), Anca-Elena Alexandrescu, Renata Khasanova, Nauman Shahid, Marc Bourqui, and Mahsa Taziki
- R.A.I.D.F.S. – Randomized Aggregation Independent Distributed File System, A P2P Distributed File System with an API for Map-Reduce Integration
by Jérémy Gotteland (Team Leader), Valerian Pittet, David Froelicher, Alban Marguet, Sven Reber and Pascal Cudré
(https://github.com/MGrin/p2p-mapreduce; final report; presentation; manual)
- Time-aware Foursquare Venues Recommender
by Alexandru Ardelean, Julia Chatain, Emma Hesseborn Fagerholm, Ivan Gavrilovic (Team Leader), Bernard Maccari, Matteo Pagliardini, Boris Perovic, Tiziano Signo, and Jakub Swiatkowski
- Submetrics (formerly TV Shows Recommender)
by Claire Musso, Florian Simond, Grigory Rozhdestvenskiy, Khalil Hajji, Nassim Drissi El Kamili, Nils Bouchardon, Simon-Pierre Génot, and Raphaël von Aarburg (Team Leader)
(www.submetrics.org; https://github.com/xEcEz/TVSS; final report; EPFL press release)
- Unmasking Markovian Malware on Twitter
by Maxime Augier and Gowthami Ramasamy
(https://github.com/maugier/cs422; final report)
- Weather Dashboard
by Aubry Cholleton (Team Leader), Jonathan Duss, Anders Hennum, Alexis Kessel, Quentin Mazars-Simon, Cedric Rolland, Orianne Rollier, David Sandoz, and Amato Van Geyt
(https://github.com/weatherTeam/weatherDashboard; final report; presentation)
Projects without a public repo:
- PAST: Processing and Storage of Time series
- Random Trip
This course is intended for students who want to understand modern large-scale data analysis systems and database systems. It covers a wide range of topics and technologies, and will prepare students to be able to build such systems as well as use them effectively address analytics and data science challenges.
- Map-reduce/Hadoop, GFS/HDFS, Bigtable/HBASE; Spark.
- SQL and relational algebra. Expressing advanced problems as queries. Data-parallel programming. Circuit complexity and its interpretation in data-parallel programming. Monad algebra. NESL, DryadLINQ, PigLatin. Data-flow parallelism vs. message passing. The bulk-synchronous parallel programming model: Pregel.
- Data locality. Memory hierarchies. New hardware. Sequential versus random access to secondary storage. Query operators – join, selection, projection, sorting. Join and sorting algorithms.
- Query optimization. Index selection. Physical database design. Database tuning.
- Parallel & distributed databases: Scaling, partitioning, replication, bloom joins. Massively parallel joins. theta-joins on map-reduce, handling skew; online map-reduce.
- Concurrency control (CC): transactions. SQL isolation levels. Anomalies. Serializability. 2-phase locking. Optimistic CC. Multiversion CC. Snapshot isolation. Distributed transactions. 2-phase commit.
- Eventual consistency. The CAP theorem. NoSQL systems. NewSQL systems.
- OLAP, data cubes. The data warehousing workflow, ETL. Data mining: Frequent itemsets (the a-priori algorithm), association rules. Clustering. Decision tree construction.
- Basics of big data machine learning.
- Realtime analytics: Data stream processing: DSMS and CEP systems. CQL. Window semantics and window joins. Load shedding. Sampling and approximating aggregates (no joins). Querying histograms. Maintaining histograms of streams. Synopes. Haar wavelets. Incremental and online query processing: incremental view maintenance: materialized views, delta processing; online aggregation – sampling, ripple joins, error bounding.
Required prior knowledge
- A basic course on database systems (e.g. covering parts III, IV, and V of Ramakrishnan and Gehrke on storage and indexing, query processing, and concurrency control).
7 credits: 3 (lectures) + 2 (exercises) + 2 (project). This course is taught in English.
Getting a grade
This course uses in course grading. Attendance of and active participation in the plenaries is mandatory. Attending the exercises is optional but please keep in mind that the TAs spend a lot of time there so please be nice trying to ask your questions there rather than asking for separate appointments. If you cannot attend the project meetings you have to arrange this with your team.
Generally speaking, you must attend the plenaries, since quizzes, the final, and classroom tasks take place there. If you miss a class and bring a certificate from a doctor showing that you were sick, we will compute your grade as if this day did not exist. Overall, you can obtain a hundred points in the course. If you, say, miss a class with a quiz (2pt) and a classroom task (~ 2pt), we’d take the score you obtain in the rest of the course and re-normalize by multiplying it with 100/(100-2-2).
If you miss a class because you present a paper at a research conference, we may treat this like a case of sickness.
However, job interviews and internships are not acceptable reasons for missing classes. You have to schedule these around the course.
The exception is the final exam. If you miss the final because of sickness (the only acceptable reason), you will have to repeat it at another date, possibly orally.
Academic integrity and group work
Quizzes, the final, homework, and, unless stated otherwise, project work are to be done individually. Collaboration on these will be considered cheating.
We thank Microsoft for a Microsoft Azure teaching grant.