Extracting data from template-based websites

Project Details

Extracting data from template-based websites

Laboratory : LSIR Master Completed





Description:

There are many website that are dynamically generated from a database and a template. Examples include e-commerce websites such as Amazon, ads such as craiglist.com or flight schedules such as swiss.com.

This project consists in implementing an algorithm that takes a set of template-generated pages from one given website, automatically learns the template and extracts the data from the template. The starting point is the publication titled “Extracting Structured Data from Web Pages”, Arasu, Stanford.

Tasks:

  • Implement the algorithm proposed in the cited publication
  • Run and analyze the success rate for a set of given websites
  • Propose improvements
  • Implement a crawler suited for the task

Requirements

  • Expertise in Java or Python
  • Previous work on unsupervised learning methods

This project will be jointly supervised by David Portabella (at http://db4all.com/)  and Zoltan Miklos


Site:
Contact: Zoltan Miklos