Graph Search background ‒ Facts ‐ EPFL

The EPFL Graph project is an initiative of the CEDE and the CHILI Lab, created with the goal of opening up academic content to the EPFL community. The Vice Presidency for Academic Affairs considers that EPFL Graph is of strategic importance to the institution, and is actively supporting its evolution from a research project into a fully-fledged IT service for the whole EPFL community.

The main product of the EPFL Graph project is the web application Graph Search.

Overview

The concept of knowledge graphs (or knowledge networks) has become increasingly popular in the digital technology world, as companies began dealing with an increase in volume and complexity of information. Companies such as LinkedIn, who manage enormous amounts of data, adopted knowledge graphs as a way of representing information that favours “entities over keywords”—that is, representing information as real-world entities and their relationships to one another.

LinkedIn, for example, uses knowledge graphs to represent people, companies, jobs, and skills (what they call entities) and how they relate to one another. They named it the Economic Graph. Other companies like Google and Facebook have also adopted knowledge graphs as a primary way of organising information, which they named Knowledge Graph and Social Graph respectively.

At EPFL, institutional data is still primarily stored in traditional relational databases. The EPFL Graph project draws inspiration from the use of knowledge graphs by digital technology companies, with the aim of reorganising institutional data as interconnected entities such as courses, lectures, concepts, labs and publications.

With EPFL Knowledge Graph in place, we can draw out insights and recommendations by leveraging graph theory, machine learning and natural language processing algorithms, hence enabling the EPFL community to make better informed data-driven decisions.

Main functionalities of EPFL Graph

Algorithmic data federation. The primary innovation of the project is the algorithmic federation of institutional data. At EPFL, institutional data is hosted and managed in silos by multiple teams in different departments. While this isn’t an anomaly in itself, the problem is that these data sources are not interoperable, and thus cannot be cross-analysed without repetitive and time-consuming manual labour to interconnect multiple datasets. In contrast, the EPFL Graph data pipeline is capable of establishing these connections algorithmically and centralising multiple sources of institutional data into one unified database.

Systematic scaling. In the course of developing the EPFL Graph data pipeline, the procedure for adding additional data sources was standardised and documented, which means that the number of data sources currently being ingested within the scope of EPFL Graph can be scaled up systematically, following a well-documented number of steps.

Readily-available live data. Another problem often encountered by data analysts at EPFL (or any project involving institutional data) is that data must be collected through cumbersome means, such as email attachments, CSV exports or Excel spreadsheets. Moreover, exported data is (by definition) static, and quickly becomes outdated. This cumbersome process needs to be repeated every time data updates are needed. The EPFL Graph data pipeline, in contrast, is capable of providing live readily-available data that is updated every week, and can be accessed through direct database access or through the Graph Search web interface.

Horizontal semantic search. In the EPFL Graph paradigm, data sources are structured in the form of a graph⁽¹⁾. This allows data to be queried and analysed across different sources (i.e., horizontally). In addition, the EPFL Graph data pipeline uses Natural Language Processing (NLP) algorithms to semantically interconnect different sources of text, such as course descriptions, lecture slides and publication abstracts. This allows users to search for people and academic content through concept search. For example, a user can search for “quantum computing” on Graph Search and instantly get a panorama of what is happening at EPFL in that area, such as courses teaching quantum computing, people working on the subject (or other closely related subjects) and publications available on the subject.

⁽¹⁾ Note that a graph, in the context of this project, refers to the mathematical domain of graph theory, which studies the structure and properties of networks.

Recommendations using Machine Learning (ML). In addition to the interconnected data provided by the knowledge graph, the EPFL Graph data pipeline uses machine learning to generate data-driven recommendations to the EPFL community. For example, students studying for a course can get recommendations for video lectures from other courses (which also teach the same concepts), and researchers can get recommendations related to other researchers at EPFL working on similar topics.

Future developments

EPFL Graph is a tool for the entire community of EPFL. The goal is to make it useful to as many people as possible, including students, researchers, professors and upper management. Over the next year, we will collect feedback from the community, and focus on leveraging machine learning to provide useful data-driven recommendations to EPFL users in the domains of education, research and technology transfer.

In our development roadmap, we are considering features such as:

Education. ML-powered lecture recommendations for students, and “learning paths” that allow students to optimally learn a selected concept. Also, provide content recommendations that help educate students in the scientific foundations, technology, and economics of climate science, as well as the statistical foundations of Artificial Intelligence;

Research. Identify opportunities for cross-disciplinary research, and provide matchmaking functionalities for researchers in areas such as Tech-4-Climate, Tech-4-Health (e.g., neurotechnology, computational neurosciences, personalised medicine, AI for public health), and AI-4-Science (e.g., quantum sciences, structural and synthetic biology, theoretical informatics, experimental sciences);

Tech-transfer. Track and evaluate the venture capital and startup funding environment, e.g., by visualising and forecasting investment trends. Provide matchmaking functionalities that connect EPFL labs (and researchers who wish to create a startup company) to potential investors.

History of the project

EPFL Graph didn’t start as a top-down project. It was developed organically as a result of a research collaboration between the Center for Digital Education (under the leadership of Patrick Jermann) and the CHILI Lab of Prof. Pierre Dillenbourg—in particular, while studying how people approach online learning, and how the use of MOOCs affects the performance of EPFL students.

To perform this research work, we faced the need to set up our own data infrastructure to collect data from multiple sources (MOOCs, IS-Academia, LDAP, Infoscience, etc). We quickly realised there was a problem of lack of data interoperability that prevented us from cross-analysing data from different EPFL departments.

We approached this problem with an algorithmic solution that culminated in the development of our own data pipeline, capable of ingesting and interconnecting multiple sources of institutional data, automatically, on a periodic basis.

Eventually, we opened up our unified database for other people at EPFL to use, and created a demo website that allows users to query the database using an intelligent search bar.

As people at EPFL began using this demo, realising it was useful to their work, and general interest in the project grew, EPFL upper management decided that EPFL Graph is a project of strategic importance to the institution.