Focused Crawler for User Profiles Aggregation ‒ LSIR ‐ EPFL

Project Details

Focused Crawler for User Profiles Aggregation

Laboratory : LSIR

Semester / Master

Completed

Description:

Web users generate content on many different platforms, e.g., social networks, blogs, comments, etc. The companies managing such platforms are not giving easy access to the user-generated content, because it represents a big competitive advantage for them (think about targeted ads, recommendations, etc.) Additionally, each person creates content under multiple, different Web identities (e.g., signing in with Facebook or Twitter accounts).

The goal of this project is to develop a focused crawler that is able to identify and aggregate all the online identities of a single person. The final product will be a public API that, given a single profile in input (e.g., URL of the Twitter profile), will return a list of URLs with all the matching public profiles.

I envision two main components in the crawler:

focused crawling component, built on top of pre-existing projects and libraries.
matching component, for which some novel algorithmic work is required.

At your will, the project can focus mostly on the crawling part, or on the user profiles matching. I will fill the gaps at need.

Benefits:

I’ll be directly working with you (yes, writing and reviewing code!), rather than just supervising.
All the code will be (and stay) open-source.
Opportunity to publish the work and/or extend it as a master thesis/optional master project.

Prerequisites:

Firm grasp of concepts like Object-Oriented Programming, Functional Programming and code testing.
Excellent fluency in Scala or Python. Alternatively, a solid base on Java plus at least one recent dynamic language (e.g., Ruby, Python, JavaScript, etc.)

Desirable Experiences:

Contributing to an open-source project.
Familiar with git and/or GitHub.
Knowledgeable about the HTTP protocol and, in general, Web standards.



Contact:	Michele Catasta