Deep Multimodal Fusion for Persuasiveness Prediction
Persuasiveness is a high-level personality trait that quantifies the influence a speaker has on the beliefs, attitudes, intentions, motivations, and behavior of the audience. With social multimedia becoming an important channel in propagating ideas and opinions, analyzing persuasiveness is very important. In this work, we use the publicly available Persuasive Opinion Multimedia (POM) dataset to study persuasion. One of the challenges associated with this problem is the limited amount of annotated data. To tackle this challenge, we present a deep multimodal fusion architecture which is able to leverage complementary information from individual modalities for predicting persuasiveness. Our methods show significant improvement in performance over previous approaches. Figure 1 shows an overview of our proposed architecture.

Figure 1: Architecture of our deep multimodal fusion model
Since automatic recognition of persuasiveness is not a trivial task, it is very important to identify and use the most important features for predicting it. We use high-level features to be able to identify and interpret the factors that have the most impact in differentiating persuasive videos from non-persuasive ones. Following is the description of our feature set for each modality.
Visual Descriptors: Presence and intensity of seven primary emotions and valence, activation of twenty elementary action units, head position and orientation
Acoustic Descriptors: Voice quality, prosody, MFCC
Text Descriptors: Sparse vector of tf-idf
In order to select features with the most predictive power, and to remove redundant features, we perform a t-test between the visual, acoustic, and verbal features extracted from persuasive and non-persuasive instances. We select features with p-values less than 0.05. In Figure 2, you can see the comparsion of different modalities in predicting persuasivenss using all the features and the selected features.

Figure 2: Comparison between different modalities in predicting persuasiveness
Being persuasive depends on the way someone conveys a message, which can be through visual, acoustic, and verbal signals. Considering the multimodal nature of persuasiveness, multimodal approaches are expected to perform better than unimodal approaches.We have explored different fusion methodologies such as the early fusion and late fusion using averaging and deep neural networks. Figure 3 summarizes the comparison between multimodal fusion approaches.

Figure 3: Comparison between different multimodal fusion approaches
Related Publications
- Deep Multimodal Fusion for Persuasiveness Prediction
Behnaz Nojavanasghari, Deepak Gopinath, Jayanth Koushik, Tadas Baltrusaitis, and Louis-philippe Morency, International Conference on Multimodal Interfaces(ICMI), 2016.[PDF]