By Alan Hanjalic (auth.), Klaus Schoeffmann, Bernard Merialdo, Alexander G. Hauptmann, Chong-Wah Ngo, Yiannis Andreopoulos, Christian Breiteneder (eds.)

This booklet constitutes the refereed complaints of the 18th foreign Multimedia Modeling convention, MMM 2012, held in Klagenfurt, Austria, in January 2012. The 38 revised ordinary papers, 12 targeted consultation papers, 15 poster consultation papers, and six demo consultation papers have been conscientiously reviewed and chosen from 142 submissions. The papers are equipped within the following topical sections: annotation, annotation and interactive multimedia purposes, occasion and job, mining and cellular multimedia purposes, seek, summarization and visualization, visualization and complex multimedia structures, and the targeted periods: interactive and immersive leisure and verbal exchange, multimedia maintenance: how one can ascertain multimedia entry through the years, multi-modal and cross-modal seek, and video surveillance.

Most importantly, their training scenarios are radically different from ours, where ground-truth segment labels are available at training time. Therefore, they address a different task, known as semantic segmentation in the literature [14,20], which can be seen as the fully supervised version of segment-level annotation. Note how several earlier methods proposed for image label prediction actually perform segment-level annotation. Early methods based on probabilistic models [2,5,19] describe the image as an orderless bag of segments.

8% SegProp Prod. 8% Again, TagProp will be used for p(Ll (Y )|Y ), while p(Ll (Y )|{yr }) can be obtained by Maximum Prediction (D) from any segment-level method, or by using Global SegProp (sec. 2). As in the previous section, we refer to these as “TagProp×Token” and “TagProp×SegProp”. 3 Tagprop + Global SegProp (F) We propose a novel and more elaborate technique to predict image labels by combining image-level and segment-level information. We include both segment neighbors (as in Global Segprop) and image neighbors (as in Tagprop) Ns p(Ll (Y )|{yr }, I) = N s πyi p(Ll (si )) + i I πyi p(Ll (Ii )) (17) i Note that there are two sets of weights, π S for segment neighbors, and π I for image neighbors.

Also, they cannot take advantage of training images that are only locally similar to a test image. We propose several ways to combine recent image-level and segment-level techniques to predict both image and segment labels jointly. We cast our experimental study in an unified framework for both image-level and segment-level annotation tasks. On three challenging datasets, our joint prediction of image and segment labels outperforms either prediction alone on both tasks. This confirms that the two levels offer complementary information.

