Stance Prediction is the task of classifying the stance of a particular argument expressed by a speaker, towards a certain target (motion). In this project we address the problem where only two types of stances are available (pro or against) without considering the motion explicitly, but trying to infer it from the speech. This work is focused on the IBMDebater "Debate Speech Analysis", it provides both speeches and their transcriptions labeled with the motion and their stance.
The main focus of this project is to try to see if using spoken features together with text could lead to an improvement in performance in detecting stances.
We address the Stance Prediction task by developing three kinds of models:
- Text Model: tries to predict the stances of the speech transcriptions. Its core is DistilBERT
- Audio Model: tries to predict the stances of the speeches. Its core is wav2vec 2.0
- Multimodal Model: tries to predict the stances using both speeches and their transcriptions. Combines both the aforementioned models.
Each model features a pre-trained architecture with a classification head attached at the end of it. Each model is then fine-tuned by freezing the weights of the pre-trained model up to a certain layer and training the remaining part, including the classification head.
The evaluation metric is accuracy. The evaluations on the test set showed that the best Text Model achieved 93.82%, the best Audio Model achieved 92.04% accuracy while the best Multimodal Model achieved a 94.65% accuracy, showing a little improvement in using both audio and text signals.
We wanted to investigate if audio features can be useful to predict the stance of political speeches without knowing the motion in advance. To investigate this new task we developed three different architectures:
MulT-based model: combines DistilBERT and wav2vec 2.0 outputs with a series of MulT | |
BART for motion generation and stance classification: predict the motion together with the stance of the speech by encoding the text using a BART encoder and two different BART decoders to extract features for the generative and the sequence classification task respectively. Using a series of crossmodal attentions, the extracted audio features are then combined with both generative and classification decoders. | |
BART for stance classification: uses BART textual signal and combines the outputs of its decoder with those of wav2vec 2.0 with a series of crossmodal attentions |
The code requires python >= 2.7 as well as the libraries listed in requirements.txt:
Install Modules:
pip install -U pip
pip install -r requirements.txt
Name | Username | |
---|---|---|
Lorenzo Pratesi | [email protected] | Prahtz |
Martina Rossini | [email protected] | mwritescode |
Riccardo Foschi | [email protected] | snifus |
Vairo Di Pasquale | [email protected] | vairodp |