- Definition and typology
- Components of metadata standards
- Choosing the right standard
Choosing an adequate and comprehensive metadata system and implementig it are a very important task, since it will have an impact on the access to the Montreux Jazz Festival collection, its management and the costs related to it, and its long term preservation. For this reason, it is also a very complex task, requiring a good knowledge about the metadata issue and about the project’s needs: “Metadata is expensive to create and maintain”  (p. 8).
We plan to manage this issue in three phases. First we will make a state of the art (this document) presenting the general issue, some definitions, what one should take into account for such a project, what is existing now, what is at a risk, what is important and why, etc. The conclusion of this state of the art will present strategical recommendations for the following steps: what has to be defined, and how.
The second phase will be a needs analysis: what are the objective of the project in terms of preservation, use and management, who are the users, their characteristics, how documents are produced and what information is produced to accompany those documents, which level of granularity is required, etc. This phase will produce as a result a cahier des charges.
During the third phase we will define the metadata themselves, their structure, their content, their values and their format, and the technical implementation details (tools, guidelines, etc.). The implementation itself will help refining and correcting some of the details defined before.
Definition and purposes
The most common definition of metadata is “data about data”, which says nothing else than they are not the substance of the knowldedge, the content itself, but an abstract construction allowing to apprehend the content or to act on it, without having to deal with the content itself, so a kind of surrogate.  compares metadata to labels on cans in the supermarket: without them, one should open each can and analyse the content to know what it is, where it comes from, if it is edible and until when, how much it costs, how to prepare it, etc. So metadata are a way to access the content in a “structured”  and more efficient way than if one had to access the content itself, and mostly to interact with data (by retrieving, commenting, interpreting, preserving, managing, reusing, etc.). This covers a wide variety of informations, mainly about resource discovery and control . In order to have a better practical understanding of this very broad problem, and also to better distinguish between the numerous metadata standards and their specific scopes, it is useful to make a typology of the metadata’s functions and usefulness, of what metadata can do.
Metadata can be used in very wide variety of contexts and purposes. Their first aim (actually, the first use of the term), was to help interpreting scientific data, but, under other names, they were used since the beginning of history to help retrieving and managing collections or stocks; one can think of catalogues, inventories, index tables, subject headings, classifications, etc. In the digital and networked world, metadata are also used by communities: tags, labells, html tags, etc. in order to improve retrieval.
So there are three “historical” purposes of metadata: management, identification and discovery, interpretation. These three purposes apply very well to the digital world, with its particularities: rapid technological changes, which leads to problems of readability and of information loss; need of technical intermediary between the resource and the user; sensibility to modifications, which leads to problems of authenticity and integrity; ease of diffusion, which leads to problems of rights and of management of multiple copies. Today, the term metadata is generally used for digital and networked resources, and it covers all of those problems.
There have been numerous attemps to make a typology of metadata, among these we can cite the historical “Making of America II” (MOA2) project , also used by its successor METS , with a categorization in three classes: administrative, structural and descriptive metadata; the five categories of : administrative, descriptive, preservation, technical and use metadata; and the  typology, also in five parts: resource description, information retrieval, management of information, rights management, ownership and authenticity, and interoperability. One should also mention here the OAIS information model , which is the basis of most preservation metadata initiatives, among them PREMIS . For more information on the OAIS model, see chapter on software.
All of these typologies have their advantages and disadvantages, they reflect the context in which they were created: for example,  focuses more on e-commerce, so his typology separates the ownership management; or the Making of America II project, as it emanates from librarians, doesn’t take into account the context informations, very important in archival projects. Moreover, the frontiers between each category is blurred: some metadata belong to identification and to preservation, for example;  says that “[adequate description, one kind of metadata] underpins the other applications of metadata” (p. 64), in other words, that a particular element (for example, the author) can fit several purpose: description, retrieval, preservation of the context, etc. Other examples: location information can be considerred as administrative metadata or as descriptive metadata; rights information as a specific category or a part of administrative metadata, or even part of descriptive information; structural metadata as preservation or technical metadata; and so on. More generally, a metadata standard have often several purposes: identification and management, for example, and not only management.
As the aim of this document is not to define a definitive typology, but to present the different kinds of metadata, we will simply use the two main purposes of metadata (discovery and management) to enumerate the different metadata, their importance and what they cover, depending on different authors. Then we will summarize the major elements; the needs analysis will allow defining precisely the elements sets – here we just list the most common elements.
Identification and discovery
This category includes ’s “Resource discovery and retrieval” function, ’s “Descriptive” metadata, and MOA2’s descriptive information.  uses two different categories: resource description and information retrieval, which reflects the origin of those metadata (description may be qualified objective, while retrieval uses subjective information, like abstracting, indexing, etc.). But he recognizes that both functions help realizing a unique goal: retrieving information, and that “adequate description is an essential prerequisite for resource discovery” (p. 64).
Identification information includes a single unique identifier, such as a quotation system or a location indication, and informations like title, creator, date, format, description, summary, etc. distinguishes between intrinsec and applied descriptive metadata.
- Intrinsec metadata include physical characteristics (or technical metadata, such as format, compression, size, frame rate, duration, bit rate, etc.) and labelling information. For this latter, one could also speak of bibliographic metadata, since they constitute the classical set of informations documented in library catalogues (and taken from the title page of a book, for example). In the context of the MJF project, it may be more useful to speak of production metadata, as do : director, cameramen, sound engineer, etc.
- Applied metadata include assigned identifiers and contextual information. We think here that contextual information is more management-related than description-related. But it also includes the unique identifier (at the level of the concert, it can be the serial number defined by Montreux Sounds, and at the level of the song, it can be the ISRC), and an abstract or description of the content, which is of particular importance for image and video, where there is no possibility of “full-text” access. So description of the content acts as a representation, substuting to the content itself.
Information retrieval metadata includes all the metadata whose purpose is to discover and retrieve information. Actually, resource description can also act as information retrieval metadata (one can search an author or a date), that’s the reason why  groups these two functions together. But there are other metadata that allows information retrieval.  makes the distinction between concept-based and content-based retrieval:
- Concept-based information retrieval: it is based on high abstraction level semantic indexing, through keywords from controlled vocabulari, such as thesauri, subject headings, ontologies, etc. (here it is recommended to use data value standards  ), classifications, and free-text descriptions. These elements can be the same as the applied metadata above, as well as other elements allowing a better retrieval while being identifiers (date of creation, author, format, etc.). The concept-based elements are created mainly manually.
- Content-based information retrieval (CBIR): it is based on low abstraction level indexing, through descriptors automatically created by algorithms analysing the physical characteristics of images (shape, colour, texture, etc.).
Here it seems important to notice that audiovisual material have two particular characteristics  (p. 89). The first characteristic is the temporality: unlike textual documents, audiovisual documents have a duration, and the access to their content is inseparable from this duration. The second characteristic is that audiovisual material has no minimal semantic unit, something like “words”. So it is not possible to process “full-image” retrieval. The impact of those two characteristics on image retrieval is that direct access to image and video is not possible; one needs textual (concept-based) indexing. However,  note that a textual description of an audiovisual content is less precise that content-based indexing for two reasons: the meaning of audiovisual document is indeed not based on words, so a textual description cannot be fully efficient; and the subjectivity of conceptual analysis is bigger for audiovisual material than for other material. So a mixed way of concept and content-based processing has to be used in order to guarantee an efficient retrieval. Methods of generation of semantic indexes through automatic analysis are currently investigated, and are one of the project’s research aims .
’s third class is metadata allowing management of information.  speaks about “control”, and  about “administrative” metadata.  speaks about metadata supporting the reuse, management and long-term preservation;  as well as  split this category in several ones, that  keeps grouped: administrative, preservation, technical and use metadata belong all to this category. A contrario,  considers rights metadata as part of management metadata, while  separates management and rights, since he emphasizes on e-commerce.
So management metadata includes:
- Control and management of the resources. This category encompasses intellectual and physical management (identification and location of the various copies). So this category overlaps with identification metadata.
- Rights management (who owns the rights, who can access, at which conditions, etc.).
- Technical information: structure, format, resolution, sampling rate, hardware and software dependencies (for the creation and for the rendering of the material), etc.
- An important part of this category, which can constitute a category by itself, is preservation metadata (see below).
- Structural metadata, used to bind several items within a collection, or several parts of an item, or several versions of the same item.
Preservation metadata have been developed only recently, mainly by the OCLC/RLG working group, then by its successor the PREMIS working group, which has produced the PREMIS data dictionary .
The informations needed for preservation cover the administrative and structural categories, “in that [preservation metadata] supports the management of digital objects in an archival setting” . In other words, in a preservation project, management metadata is preservation metadata, so one can consider that a good preservation metadata element set should be sufficient to manage all the aspects of a collection.  also defines preservation metadata as “the information a repository uses to support the digital preservation process [… that is] viability, renderability, understandability, authenticity, and identity. Preservation metadata thus spans a number of the categories typically used to differentiate types of metadata: administrative (including rights and permissions), technical, and structural.” All of those information are contained in the OAIS PDI (Preservation Description Information) and in the Information Content itself (Representation Information) .
Since the goal of preservation repositories is to ensure that “the content of an archived object can be rendered and interpreted, inspite of futre changes in access technologies” , preservation metadata includes of course technical metadata. The OAIS Representation Information (including semantics, which allows users to interpret correctly the rendered data, and structure, which allows data to be rendered by computers) belong to this category [note].
Only descriptive metadata (and some metadata particular to a specific genre of document) are generally not included, or not totally included, in preservation element sets. So, given that our project has a preservation goal, we will focus here on preservation metadata instead of administrative and management metadata.
Following , preservation metadata is important for three reasons (p. 6): 1° “digital objects are technology-dependent”. In order to overcome this dependency, which is accentued by technological obsolescence, it is necessary to document the software and hardware environment of a document. 2° “digital objects are mutable”. Alteration, degradation, or even planed migration, leans to information loss. In a historical context, it is necessary to document the provenance and authenticity of documents, including the changes they have suffered over time. 3° “digital objects are bound by intellectual property rights”. Preservation actions can be limitated by those rights, and it is important to know them in order to take the adequate actions.
Without going in the elements’ details, we can list the main areas relevant to preservation metadata :
- Provenance, or custodial history, from creation to the different changes in ownership and retention policies.
- Authenticity: information validating that the object is what it purports to be, without alteration.
- Preservation activity: actions taken over time to preserve the digital objects, and their consequences.
- Technical environment: hardware, software and operating system required to render and use the content.
- Rights management limitating the preservation actions and the dissemination of content.
Defining metadata doesn’t only mean defining elements; a metadata schema comprises other definitions. Here we present the main components of a metadata standard which have to be defined by an archive.
The main component of a metadata standard is the element set (also called data dictionary, data structure, or metadata format). The element set defines two things: the semantics, or the meaning of the elements themselves (called “semantic units” in ); and the content, or how the elements should be populated: “what and how values should be assigned to elements” .
The semantics is generally defined by the element’s names, their definition, some examples, and indications on the repeatability and obligation of the element. In order to disambiguate the meaning of semantic units, some standards make use of XML namespaces, solved by URI (Uniform Resource Identifiers).  doesn’t recommend this method, because then the definition is out the repository’s control. One should prefer defintions stored locally (p. 18-19).
The content instructions explain what are the sources of information to take into account when creating metadata records, how to create them (representation rules, or syntax encoding scheme: case, punctuation, date format, etc.), and what are the allowable content values (vocabulary encoding scheme, such as controlled vocabularies, thesauri, subject heading list, classifications, unique identifiers, etc.). To be more illustrative, content instruction for the MJF project would include among other: standards, or even authority lists, for the artists’ names, a thesaurus for instruments and musical vocabulary (medley, solo, chorus, theme, etc.), indications on the reliable source of information for exact titles and rights, and a standard for unique identifiers, such as ISRC (International Standard Recording Code).
Content instructions don’t only apply to descriptive metadata set, but also to other types of metadata. These rules can be included in the element set, together with the semantics, or take the form of a manual. For technical metadata about the file format and its rendering, formats registries can be used (see chapter on formats), although this method is not recommended, given that the reliability and perennity of such registries are uncertain.
Defining the element set with all those components is the best way to ensure that the metadata records will be consistent, ensuring a better retrieval, and also facilitating the creation and the exchange of metadata, while facilitating the interpretation of metadata, which can have several meaning depending on the user. It also ensures the interoperability and renderability of data over long term.
Another important component of a metadata model is the elements’ syntax, or the way elements are encoded, mainly through metalanguages (mainly mark-up languages such as XML or SGML) for machine processing, or simply expressed in text format for human-reading. Although encoding is necessary in the implementation phase, not all the metadata standards define it; the scheme is then called syntax-independent. For example, the Dublin Core was first designed without any encoding.
In the digital world, XML is often chosen as the preferred metadata syntax, allowing to define the semantics (element set and content) through a XML or SGML DTD (Document Type Definition) such as HTML, or through XML schemas (for example: PREMIS has an implementation with XML schema); XML has indeed the advantage to allow automated validation.  mentions XML as an unavoidable encoding, as well as : “XML has long been acknowledged as a robust and human-readable format for the archiving of metadata […]. It is non-proprietary […]. Its flexibility ensures that archived metadata in XML shoulb be readily usable in future deliverable mechanisms” (p. 18). But some digital file formats, such as MXF, encode embedded metadata in KLV (Key-Length Value).
A metadata record (i.e. “an instance of a metadata element set, complying with a given metadata format“  (p. 7)) can be stored in different ways in order to be useful: inside the object (embedded), outside the object and linked to it (database of records), or handled by a third party. The latter solution won’t be presented nor discussed here, because it applies to networked libraries.
Several formats, also in the domain of audiovisual archives, allow to store some metadata embedded inside the object, in self-describing wrappers: the Adobe XMP or JPEG2000, both based on XML, or MXF, based on KLV (Key-Lenght Value) standard. The advantage of this encapsulation is that it “ensures that metadata is always stored and transported with the record, simplifies the long-term management, and assures that the retrieved record is physically self-explanatory”  (p. 173). So embedding metadata is useful for transport, but also for management (preservation). But on the other hand, it is not possible to store all of the metadata elements encapsulated with the content, because the file formats allowing it are restrictive, they don’t allow to include the full range of metadata required for management, preservation, and most of all, for an efficient retrieval of items.
The other option is to store metadata in databases, linked to objects: relational databases, or XML databases. DAMs usually integrate a data management layer: it is an OAIS entity. Storing metadata in a database has the advantage of fast access; moreover, near-online (tape libraries) or offline (on the shelves) archives cannot be retrieved easily without databases.
Although self-description of a digital file format is one of the main requirement for archives (see chapter on formats), it is not necessary that all metadata are stored within the file, but only those required to interpret and preserve the file.  mentions that the OAIS Archived Information Package is conceptual, and that it is not required that all of metadata are stored in a single package (p. 5). In conclusion, let’s say that “storing metadata elements in a database system has the advantages of fast access, easy update, and ease of use for query and reporting. Storing metadata records as digital objects in repository storage along with the digital objects the metadata describes also has advantages: it is harder to separate the metadata from the content, and the same preservation strategies that are applied to the content can be applied to the metadata. Recommended practice is to store critical metadata in both ways”  (p. 17). Generally, administrative and structural metadata are stored with the data, and descriptive metadata apart, in another file.
Lots of standards have been developed by several communities, with different goals: each institution, following the type of material it owns, the use it has of it, its users, its policies, etc., has specific needs; there is no single metadata standard that is adequate for describing all types of collections and materials . Therefore, one could wonder why to adopt standards, and no to develop an in-house model. But there are several advantages to use standards:
- Developing one’s own model is time consuming and is at a high cost. Adopting a standard is cheaper to create and maintain.
- Benefitting from other’s experience allows avoiding mistakes; standards can so act as best practices and check-lists. Comparison is possible.
- One cannot presume of the future use in the dissemination of the archive; so using standards is the best way to ensure that those future uses won’t be affected by eventual lack of interoperability.
- Interoperability is a good way to ensure that metadata will be able to be migrated to new systems. If it is accepted that metadata is essential to the understanding of digital information over time, so interoperability of metadata is “essential if we are to preserve our digital information”  (p. 139).
- The interoperability of data themselves (their self-understanding, in any environment, with any hardware or software) can be ensured by metadata (cf. criteria of self-documentation in the chapter on formats).
However, adopting a standard doesn’t mean adopting all elements of this standard; it can be customized, adapted, and augmented by other standards, in order to address the specific needs of the project. Moreover, the elements are not necessarily atomic: they can be deconstructed into more precise components  (p. 3). Indeed, the main requirement is interoperability, so understandability.
Several factors can be outlined in order to select an appropriate and adequate metadata model.
First of all, the “most important”  principle is interoperability.  cites two ways of guaranteeing interoperability: technical (or syntactic) and semantic. So one should choose encoding standard like XML, and adopt standardised metadata formats.
In order to be able to integrate several standards together (a descriptive and a preservation one, for example), and to be able to exchange data with other institutions using other standards, it is necessary to adopt another kind of interoperability: the structural one  This is achieved through frameworks and wrapper technologies, like METS.
METS is a XML schema able to contain in one file (for exchange) administrative, descriptive and structural metadata. It mays be referrenced (linked to objects) or embedded (data and metadata in the same file)  (p. 16), accordingly to the OAIS principles. METS application profiles, such as the one developed by National Library of Australia  are “methods that allow metadata creators to combine elements from multiple formats as needed”  (p. 27). XSLT-based metadata mappings between metadata standards expressible in XML is an alternative to the problem of interoperability, maybe easier and cheaper to develop. However, the National Library of Australia developed a METS profile in order to intergrate PREMIS metadata with other descriptive and format specific metadata, which could act as a start point for further reflexions. About this profile, see , , .
In addition to interoperability,  (p. 8) cites also two factors. A preservation metadata schema should be comprehensive: a standard should be as complete as possible, even if it exceeds the current needs; only a portion of the whole set can be used in order to remain simple. It also should be oriented towards implementation. The creation of metadata and its preservation are expensive and time consuming. So principles of simplicity –  says that the core element set should be as minimal as possible – modularity, reusability and extensibility should be taken into account, as well as the possibility of automated creation when possible. Finally, it is also very important to speak about viability of metadata, which have to be preserved along with data. This latter can be realized through “high-quality, standards-based, system-independent metadata [which] can be used, reused, migrated, and disseminated in any number of ways, even in ways that we cannot anticipate at this moment” , such as XML, and also by Representation Information , which is recursive: the metadata defines how to interpret data, but they also have to be defined in order to be readable.
The process of creating and recording metadata is time consuming and expensive, but is necessary in order to ensure access to digital data over long term. So the scope and use have to be well defined through interviews and precise analysis; this will be done in the next phase of the project, and will take into account elements such as :
- The institution: resources available (time, skills), use of similar institutions as best practices, existing metadata and content standards, their format, the system environment. To know what are the current practices, one should take a look at , and examine what are the users communities of standards like Dublin Core, PREMIS or METS.
- The standard: one should choose the most specific standard related to the domain of MJF, broadly used and well known, stable and frequently updated.
- The materials: the elements vary in function of the genre and format of materials to be preserved, and of the users. So one should define which elements are required to document the particularities of the material preserved, and models which could the best integrate those elements.
- The project: an in-depth analysis should help defining elements such as granularity of the description.
[note] One should however note that semantic information apply more to scientific data than to videos, and especially to musical videos like the MJF ones, which “speak” by themselves. They “have meaning in themselves, not as part of a compound digital object. They are, in other words, discrete digital objects”  (p. 5). But structure information has to be documented. In the PREMIS model  (p. 13), this is done through the relationship entities. Structural metadata have to be defined for video, maybe the PREMIS model is not sufficient (cf. , p. 17).
All references accessed on December 16, 2008.
 BRADLEY, Kevin. LEI, Junran. BLACKALL, Chris. Towards an Open Source Repository and Preservation System: Recommendations on the implementation of an Open Source Digital Archival and Preservation System and on Related Software Development [online]. Paris: Unesco, 2007. 34 p. http://portal.unesco.org/ci/en/ev.php-URL_ID=24700&URL_DO=DO_TOPIC&URL_SECTION=201.html
 CAPLAN, Priscilla. Preservation Metadata. In: DCC Digital Curation Manual [online]. July 2006. 26 p. http://www.dcc.ac.uk/resource/curation-manual/chapters/preservation-metadata
 CHAN, Lois Mai. ZENG, Marcia Lei. Metadata Interoperability and Standardization: A Study of Methodology Part I. In: D-Lib Magazine [online]. Vol. 12 Nr. 6, June 2006. http://www.dlib.org/dlib/june06/chan/06chan.html
 CONSULTATION COMMITTEE FOR SPACE DATA SYSTEMS (CCSDS). Reference Model for and Open Archival Information System (OAIS) [online]. Washington, DC: CCSDS, 2002. 148 p. http://public.ccsds.org/publications/archive/650x0b1.pdf
 DAY, Michael. Metadata. In: DCC Digital Curation Manual [online]. November 2005. 41 p. http://www.dcc.ac.uk/resource/curation-manual/chapters/metadata
 DAY, Michael. Preservation Metadata Initiatives: Practicality, Sustainability and Interoperability [online]. Bath: UKOLN, University of Bath, 2004. 16 p. http://www.ukoln.ac.uk/preservation/publications/erpanet-marburg/day-paper.pdf
 EPFL. Montreux Jazz Festival Digital Archive Project: A unique and first of a kind high resolution digital archive of the Montreux Jazz Festival. October 2008.
 FOULONNEAU, Muriel. RILEY, Jenn. Metadata for Digital Resources: Implementation, Systems Design and Interoperability. Oxford: Chandos Publishing, 2008. 203 p.
 GILL, Tony. GILLILAND, Anne J. WHALEN, Maureen. WOODLEY, Mary S. Introduction to Metadata [online]. Los Angeles: Getty, 2008. http://www.getty.edu/research/conducting_research/standards/intrometadata/index.html
 GOUYET, Jean-Noel. GERVAIS, Jean-François. Gestion des médias numériques: Digital Media Asset Management. Paris: Dunod, 2006. 328 p.
 GUENTHER, Rebecca S. Battle of Buzzwords: Flexibility vs. Interoperability when Implementing PREMIS in METS. In: D-Lib Magazine [online]. July 2008. http://www.dlib.org/dlib/july08/guenther/07guenther.html
 HARTMAN, Cathy Nelson. GELAW ALEMNEH, Daniel. HASTINGS, Samantha Kelly. Metadata Approach to Preservation of Digital Resources: The University of North Texas Libraries’ Experience. In: First Monday [online]. 2002. http://www.firstmonday.org/issues/issue7_8/alemneh/index.html
 HAYNES, David. Metadata for information management and retrieval. London: Facet Publishing, 2004. 186 p.
 LAVOIE, Brian. GARTNER, Richard. Preservation Metadata [online]. York:Digital Preservation Coalition, 2005 (Technology Watch Report 05-01). 21 p. http://www.dpconline.org/docs/reports/dpctw05-01.pdf
 LAZINGER, Susan S. Digital Preservation and Metadata: History, Theory, Practice. Englewood, Colorado: Libraries Unlimited, 2001. 359 p.
 OCLC/RLG WORKING GROUP ON PRESERVATION METADATA. Preservation Metadata for Digital Objects: A Review on the State of the Art. [online]. Dublin, Ohio: OCLC, 2001. 50 p. http://www.oclc.org/research/projects/pmwg/presmeta_wp.pdf
 OCLC/RLG WORKING GROUP ON PRESERVATION METADATA. A Metadata Framework to Support the Preservation of Digital Objects [online]. Dublin, Ohio: OCLC, 2002. 51 p. http://www.oclc.org/research/projects/pmwg/pm_framework.pdf
 PEARCE, Judith. PEARSON, David. WILLIAMS, Megan. YEADON, Scott. Report of the METS Profile Development Project [online]. APSR, 2007. http://www.apsr.edu.au/nla-mets/mets_profile_report.pdf
 PREMIS EDITORIAL COMMITTEE. PREMIS Data Dictionary for Preservation Metadata version 2.0 [online]. Washington, DC: Library of Congress: 2008. 224 p. http://www.loc.gov/standards/premis/v2/premis-2-0.pdf
 PREMIS Working Group. Implementing Preservation Repositories For Digital Materials: Current Practice and Emerging Trends in the Cultural Heritage Community: A Report by the Premis Working Group [online]. Dublin, Ohio: OCLC, 2004. 66 p. http://www.oclc.org/research/projects/pmwg/surveyreport.pdf
 RAIELI, Roberto. INNOCENTI, Perla. The achievable innovation by the way of MultiMedia Information. In: Bollettino AIB, vol. 45, no. 1, Mar 2005, pp. 17-47.