IM2.IP1: Integrated Multimodal Processing


Full title

IM2.IP1: Integrated Multimodal Processing

Project website

Duration of project

From January 2010 to December 2013

Funding source

Swiss National Science Foundation


This project will focus on the IM2 core multimodal technologies (speech processing, visual processing, integration of modalities, coordination among modalities, further development and evaluation of meeting browsers) geared towards integration into end-to-end applications and consolidating all IM2 activities developed in Phase I and Phase II. Research focus in IP1 will also be driven by the findings and possible requirements arising from IP2.

IP1 has a research component (pursuing the most promising and/or fundamental research directions initiated in IM2 Phase II), as well as a strong integration and evaluation component. Hence, besides further pursuing some of the most promising research direction in multimodal processing in the strict context of the IM2 vision, one of the objectives of IP1 will be to extend the applications of multimodal technologies, within the human meeting and conference framework, towards more integrated systems that work in real time, with human intervention only when required.

Research activities

MMSPL team is involved in IM2.IP1, working on:


  • Multimodal quality metrics for multimedia content abstraction

Multimedia services rely on the presence of two main actors: the human subject, who is the end user of the service, and the multimedia content, which is the object of the multimedia communication. In this scenario one of the most relevant features, implicitly taken into account by the end user, is the quality of the multimedia data involved in the application of interest. The user interacts with the multimedia data and she/he very easily judges the quality of its content and, widely speaking, the quality of the multimedia experience which she/he is participating. The “quality” is a particular feature of the multimedia content, since it depends upon the peculiarity of the content itself but it is also strictly related to the subjectivity of the human beings who interact with the content. This is the reason why the user-media interaction can be defined as “multimedia experience”.

Our research in this scenario focuses on the subjective quality assessment modeling and mapping into objective algorithms, i.e. metrics. The goal of our study is to design objective metrics which allow automatic evaluation of the quality of multimedia content, highly correlated with the real human perception. In particular, an important part of our research concentrates on the understanding and modeling of the multi-modal perception of quality, i.e. visual, audio and audio-visual quality, in order to design a metric for the assessment of the more general and complex concept of Quality of Experience in a multimedia service.


  • Tagged media-aware multimodal content annotation

The approach for multimedia content access based only on content analysis has not delivered widely accepted solutions. User activities in social networks, as tagging, annotating and rating of multimedia content, provide an entirely new view on how to solve the multimedia content access problem.

The goal of this research is to find new models of interaction between automatic multimedia content analysis and social tagging. This project takes as successful instances new services and products such as Flickr, Facebook, YouTube, MySpace, and many others. Our research will address the challenge of efficient management and organization of image collections by enriching images with a semantic context. More details about this research activity can be found here.


Results and resources

Multimodal quality metrics for multimedia content abstraction

  • Subjective evaluation of next-generation video compression algorithms: a case study (presented at SPIE’10) [paper]
  • Subjective evaluation of scalable video coding for content distribution (presented at ACM MM’10) [paper]
  • Gesture and touch controlled video player interface for mobile devices (presented at ACM MM’10) [paper]
  • Audio-visual asynchrony detection in multimedia content (presented at NEM’10) [paper] [presentation]

Tagged media-aware multimodal content annotation