Extracting data from template-based websites ‒ LSIR ‐ EPFL

Project Details

Extracting data from template-based websites

Laboratory : LSIR

Master

Completed

Description:

There are many website that are dynamically generated from a database and a template. Examples include e-commerce websites such as Amazon, ads such as craiglist.com or flight schedules such as swiss.com.

This project consists in implementing an algorithm that takes a set of template-generated pages from one given website, automatically learns the template and extracts the data from the template. The starting point is the publication titled Extracting Structured Data from Web Pages, Arasu, Stanford.

Tasks:

Implement the algorithm proposed in the cited publication
Run and analyze the success rate for a set of given websites
Propose improvements
Implement a crawler suited for the task

Requirements

Expertise in Java or Python
Previous work on unsupervised learning methods

This project will be jointly supervised by David Portabella (at http://db4all.com/) and Zoltan Miklos

Site:

Contact:	Zoltan Miklos