korean-audiotext-transformer

Multimodal Transformer for Korean Sentiment Analysis with Audio and Text Features

Overview

overview

STEP1: Convert input audio to text using Google ASR API
STEP2: Extract MFCC feature from input audio
STEP3: Conduct MLM on KoBERT through colloquial review texts crawled from various services
STEP4: Extract sentence embedding by using the text obtained in STEP1 as an input variable of the BERT learned in STEP3
STEP5: Obtain fused representation by using the MFCC feature (from STEP2) and the sentence embedding (from STEP4) as input variables of the Crossmodal Transformer (Tsai et al., 2019)

Datasets

Korean Multimodal Sentiment Analysis Dataset

자율지능 디지털 동반자 감정 분류용 데이터셋 (registration and authorization are needed for downloading)
Classes: 분노(anger), 공포(fear), 중립(neutrality), 슬픔(sadness), 행복(happiness), 놀람(surprise), 혐오(disgust)
Provided Modality: Video, Audio, Text
- We only use audio and text data
- When testing, text is obtained via ASR from audio without using the data provided
- Vision modality is not considered in this project
Train / Dev / Test Split
- Based on audio: 8278 / 1014 / 1003
- Based on text: 280 / 35 / 35

Preprocess

Locate downloaded dataset as follows:

korean-audiotext-transformer/    
└── data/
    └── 4.1 감정분류용 데이터셋/
        ├── 000/
        ├── 001/
        ├── 002/
        ├── ...
        ├── 099/
        ├── participant_info.xlsx
        ├── rename_file.sh
        ├── Script.hwp
        └── test.py

Convert Script.hwp to script.txt

cd data/4.1 감정분류용 데이터셋
hwp5txt Script.hwp --output script.txt

Generate {train, dev, test}.pkl

python preprocess.py \
  raw_path='./data/4.1 감정분류용 데이터셋' \
  script_path'./data/4.1 감정분류용 데이터셋/script.txt' \
  save_path='./data' \
  train_size=.8

Preprocessed Output (train.pkl)

person_idx	audio	sentence	emotion
0	[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …	오늘 입고 나가야지.	행복
2	[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …	오늘 입고 나가야지.	행복
7	[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …	오늘 입고 나가야지.	행복
12	[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …	오늘 입고 나가야지.	행복
17	[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …	오늘 입고 나가야지.	행복

Usage

Prerequisites

We recommend to use conda environment to setup

conda create -n <your_env_name> python=3.6
conda activate <your_env_name>
conda install pip
pip install -r requirements.txt

Before training,

Download fine-tuned BERT, and locate the model as follows:

The model was already fine-tuned with a Korean sentimental analysis dataset
We use this model as a text embedding module (do not fine-tune anymore)

korean-audiotext-transformer/    
└── KoBERT/
    ├── args.bin
    ├── config.json
    ├── model.bin
    ├── tokenization.py
    └── vocab.list

Train AudioText Transformer

specified hyper-parameters are the best ones on development dataset

python train.py \
  --data_path='./data' \
  --bert_path='./KoBERT' \
  --save_path='./result' \
  --attn_dropout=.2 \
  --relu_dropout=.1 \
  --emb_dropout=.2 \
  --res_dropout=.1 \
  --out_dropout=.1 \
  --n_layers=2 \
  --d_model=64 \
  --n_heads=8 \
  --lr=1e-5 \
  --epochs=10 \
  --batch_size=64 \
  --clip=1.0 \
  --warmup_percent=.1 \
  --max_len_audio=400 \
  --sample_rate=48000 \
  --resample_rate=16000 \
  --n_fft_size=400 \
  --n_mfcc=40

Train Audio-Only Baseline

python train.py --only_audio \
  --n_layers=4 \
  --n_heads=8 \
  --lr=1e-3 \
  --epochs=10 \
  --batch_size=64 \

Train Text-Only Baseline

python train.py --only_text \
  --n_layers=4 \
  --n_heads=8 \
  --lr=1e-3 \
  --epochs=10 \
  --batch_size=64 \

Evaluate Models

python eval.py [--FLAGS]

Results

Text-Only Baseline

text_only

Emotion	Total	공포	놀람	분노	슬픔	중립	행복	혐오
F1-score	33.95	75.00	33.33	44.44	22.22	18.18	44.44	0.00

text_cm

Audio-Only Baseline

audio_only

Emotion	Total	공포	놀람	분노	슬픔	중립	행복	혐오
F1-score	35.28	31.84	42.68	24.71	47.32	35.80	44.52	20.12

audio_cm

Multimodal (Crossmodal) Transformer

crossmodal

Emotion	Total	공포	놀람	분노	슬픔	중립	행복	혐오
F1-score	52.54	44.18	34.44	50.95	81.81	34.28	65.93	56.19

cross_cm