Cleaning up metadata in the Montreux Jazz Digital Project database to ensure data entries yield accurate search results in the Montreux Jazz Festival digital archive.
A large amount of the original data entries from previous Montreux Jazz databases contained errors or were duplicates. However, in order to find and access content in the Montreux Jazz Festival digital archive easily, it is essential that all data entries are accurate. The first step to creating a searchable archive was to input all data in a single Montreux Jazz Digital Project database. The second step, as part of the documentation process, is to ensure all existing and future data entries are cleaned up and coherent overall. The main objective of the data cleanup project is to ensure each concert is linked to one correctly spelt artist or band name. Furthermore, a cleanup of all instruments is also required.
Currently there are 43’000 artist names (separated names and duplicates included). After the data cleanup there will be 30’000 artist names. This is almost a third of the original amount of data entries thus highlighting the importance of the unified Montreux Jazz database cleanup project.
There are many types of data errors. For example, artist names may be abbreviated, shortened, in capitals, inverted, wrongly spelt and in some cases one name may represent more than one artist. In the example below, only the last entry is correct.
Example: Ph. Collins; P. Collins; Collins, Phil; PHIL COLLINS; Phil Colins; Phelps Collins; Phil Collins
15 students are involved in the data cleanup project to validate corrections. There are five processes to the data cleanup project. The first is automatic, a cleanup algorithm is launched and the remaining four processes are semi-automatic requiring validation from students. This means a person checks if each data entry has been corrected properly and make changes if necessary. The five-stage data cleanup process takes place as follows:
- Automatic cleanup: includes automatic removal of spaces and dots, merging of double names and de-capitalization of letters that should be in lowercase
- Name split: splitting fused names into separate entities (requires validation)
- Name re-arrangement: splitting first and last names
- Name edit: resolving first name errors and abbreviations (requires validation)
- Name flipping: inversing first and last names (requires validation)
- General cleanup: validation of remaining artist names