Deep Multimodal Fusion for Persuasiveness Prediction

Persuasiveness is a high-level personality trait that quantifies the influence a speaker has on the beliefs, attitudes, intentions, motivations, and behavior of the audience. With social multimedia becoming an important channel in propagating ideas and opinions, analyzing persuasiveness is very important. In this work, we use the publicly available Persuasive Opinion Multimedia (POM) dataset to study persuasion. One of the challenges associated with this problem is the limited amount of annotated data. To tackle this challenge, we present a deep multimodal fusion architecture which is able to leverage complementary information from individual modalities for predicting persuasiveness. Our methods show significant improvement in performance over previous approaches. Figure 1 shows an overview of our proposed architecture.

Figure 1: Architecture of our deep multimodal fusion model

Since automatic recognition of persuasiveness is not a trivial task, it is very important to identify and use the most important features for predicting it. We use high-level features to be able to identify and interpret the factors that have the most impact in differentiating persuasive videos from non-persuasive ones. Following is the description of our  feature set for each modality.

Visual Descriptors: Presence and intensity of seven primary emotions and valence, activation of twenty elementary action units, head position and orientation

Acoustic Descriptors: Voice quality, prosody, MFCC

Text Descriptors: Sparse vector of tf-idf

In order to select features with the most predictive power, and to remove redundant features, we perform a t-test between the visual, acoustic, and verbal features extracted from persuasive and non-persuasive instances. We select features with p-values less than 0.05. In Figure 2, you can see the comparsion of different modalities in predicting persuasivenss using all the features and the selected features. 

Figure 2: Comparison between different modalities in predicting persuasiveness

Being persuasive depends on the way someone conveys a message, which can be through visual, acoustic, and verbal signals. Considering the multimodal nature of persuasiveness, multimodal approaches are expected to perform better than unimodal approaches.We have explored different fusion methodologies such as the early fusion and late fusion using averaging and deep neural networks. Figure 3 summarizes the comparison between multimodal fusion approaches.

Figure 3: Comparison between different multimodal fusion approaches

For more information please see the original paper. The code for this paper is publicly available. 

Related Publications

- Deep Multimodal Fusion for Persuasiveness Prediction
Behnaz Nojavanasghari, Deepak Gopinath, Jayanth Koushik, Tadas Baltrusaitis, and Louis-philippe Morency, International Conference on Multimodal Interfaces(ICMI), 2016.
[PDF]