The Object-Oriented Content Transmission Metamodel

The basic DSPOOM metamodel has proven to be useful for modeling DSP processes and applications. The present chapter is concerned with the applicability of the metamodel to more abstract processes that are related to the DSP domain but also to higher-level semantic domains. It turns out that this new metamodel, instance of the basic DSPOOM, fits well to the idea of Content Transmission. We introduced the idea of Content Transmission in [Amatriain and Herrera, 2001a] and later developed it in [Amatriain and Herrera, 2001b], where the Object-oriented Content Transmission Metamodel (OOCTM) was first presented as such. In the present chapter we will highlight its main features.

As audio and music processing applications tend to increase their level of abstraction and to approach the end-user level it seems clear that one of the focuses is to step up from the signal processing realm and directly address the content level of an audio source. The term content-processing is therefore becoming commonly accepted [Camurri, 1999,Chiariglione, 2000,Karjalainen, 1999]. Content processing is a general term that includes applications such as content analysis, content-based transformations or content-based retrieval.

While manual annotation has been used for many years in different applications, the focus now is on finding automatic content extraction and content processing tools. An increasing number of projects focus on the extraction of meaningful features from an audio signal. Meanwhile, standards like MPEG7 [Martínez, 2002,Manjunath et al., 2002] are trying to find a convenient way of describing audiovisual content. Nevertheless, content description is usually thought of as an additional information stream attached to the actual content and the only envisioned scenario is that of a search and retrieval framework.

The basic idea when implementing a content processing scheme is to have a previous analysis step in which the content of the signal is identified and described. Then this description can be classified, transmitted, or transformed. Content description is usually thought of as an additional stream of information to be attached to the actual content. However, if we are able to find a thorough and reliable description we can think of forgetting about the signal and concentrate on processing only its description. And, as it will later be discussed, the goal of finding an appropriate content description is very much related to the task of identifying and describing the so-called Sound Objects.

Bearing these previous ideas in mind, a metamodel of content transmission (see Figure 5.1) is proposed as a general framework for content-based applications. As we will later see any content-based application can be modeled as a subset of this metamodel.

**Figure 5.1:** The Object Oriented Content Transmission Metamodel
$\includegraphics[% width=0.90\textwidth, keepaspectratio]{images/ch5-OOCTM/ps/OOCTMBasicBlockDiagram.eps}$

The metamodel is based on an analysis-synthesis process. Therefore, the only data involved in the transmission step is the content description taking the form of metadata. A multilevel content description tree is used as an efficient representation of the identified Sound Object hierarchy. Several technologies are available for representing content description, but, taking into account our experience in MPEG-7s standardization process [Petters et al., 1999], we would in a general situation recommend an XML-based metadata language such as MPEG-7s DDL.

A property derived from our OOCTM is that if there is a suitable content description, the actual content itself may no longer be needed and we can concentrate on transmitting only its description. Thus, the receiver should be able to interpret the information that, in the form of metadata, is available at its inputs, and synthesize new content relying only on this description. It is possibly in the music field where this last step has been further developed, and that fact allows us to think of such a transmission scheme being available on the near future.

The OOCTM does not worry about encoding fidelity or signal distortion in the classical sense. A signal has been correctly encoded and transmitted if its meaning has not changed substantially. And what ``substantially'' means depends on the particular application: in some applications we may need to keep signal-level fidelity while in others we may just be interested in the approximate content in abstract terms.

Audio representations can be classified according to the following properties [Vercoe et al., 1998]: encodability, synthesizability, generecity, meaningfulness, accuracy, efficiency and compactness. A representation is said to be encodable if it can be directly derived from the waveform. On the other hand it is said to be synthesizable if an ``appropriate'' sound can be obtained from the representation. A description that is both encodable and synthesizable is said to be invertible^5.1.

The more general a representation is the more kind of sounds it will be applicable to. Sound representations that are highly semantic (or meaningful) use parameters with clear high-level meaning and are easier to manipulate. Accuracy is a measure of how perceptually similar the synthesized sound is to the original one while efficiency is a measure of how far redundancy can be exploited in a given representation. Finally, compactness is just the ratio between accuracy and redundancy and it is a measure of how redundancy can be exploited while maintaining a certain level of accuracy.

Throughout this chapter we will use the word ``descriptor'' very often. Although this concept has already been brought up in different parts of the Thesis it is interesting to reproduce again its definition. We will use the one given in the MPEG-7 standard where a descriptor is defined as ``a representation of a feature that defines its syntax and semantics''. And a feature is defined as a ``a distinctive characteristic of the data which signifies something to somebody''.

The OOCTM is related to some existing models and metamodels. It is interesting to note that such a transmission model can be seen as a step beyond Shannon and Weavers traditional communication model [Shannon and Weaver, 1949]. We will further discuss this issue in section 5.3.1. The metamodel is also quite related to Structured Audio, a metamodel for audio and music transmission included in the MPEG-4 standard [Scheirer, 1999c]. We will comment and exploit this relation in section 5.3.2.

In the next sections, we will particularize this metamodel to the case of audio and music content transmission and will give some details on each of the components functionality.