项目作者: youngbin-ro

项目描述 :
Multimodal Transformer for Korean Sentiment Analysis with Audio and Text Features
高级语言: Python
项目地址: git://github.com/youngbin-ro/audiotext-transformer.git
创建时间: 2020-10-17T05:50:22Z
项目社区:https://github.com/youngbin-ro/audiotext-transformer

开源协议:

下载


korean-audiotext-transformer

Multimodal Transformer for Korean Sentiment Analysis with Audio and Text Features


Overview

overview

  • STEP1: Convert input audio to text using Google ASR API
  • STEP2: Extract MFCC feature from input audio
  • STEP3: Conduct MLM on KoBERT through colloquial review texts crawled from various services
  • STEP4: Extract sentence embedding by using the text obtained in STEP1 as an input variable of the BERT learned in STEP3
  • STEP5: Obtain fused representation by using the MFCC feature (from STEP2) and the sentence embedding (from STEP4) as input variables of the Crossmodal Transformer (Tsai et al., 2019)


Datasets

Korean Multimodal Sentiment Analysis Dataset

  • 자율지능 디지털 동반자 감정 분류용 데이터셋 (registration and authorization are needed for downloading)
  • Classes: 분노(anger), 공포(fear), 중립(neutrality), 슬픔(sadness), 행복(happiness), 놀람(surprise), 혐오(disgust)
  • Provided Modality: Video, Audio, Text
    • We only use audio and text data
    • When testing, text is obtained via ASR from audio without using the data provided
    • Vision modality is not considered in this project
  • Train / Dev / Test Split

    • Based on audio: 8278 / 1014 / 1003
    • Based on text: 280 / 35 / 35
  • Preprocess

    • Locate downloaded dataset as follows:

      1. korean-audiotext-transformer/
      2. └── data/
      3. └── 4.1 감정분류용 데이터셋/
      4. ├── 000/
      5. ├── 001/
      6. ├── 002/
      7. ├── ...
      8. ├── 099/
      9. ├── participant_info.xlsx
      10. ├── rename_file.sh
      11. ├── Script.hwp
      12. └── test.py
    • Convert Script.hwp to script.txt

      1. cd data/4.1 감정분류용 데이터셋
      2. hwp5txt Script.hwp --output script.txt
    • Generate {train, dev, test}.pkl

      1. python preprocess.py \
      2. raw_path='./data/4.1 감정분류용 데이터셋' \
      3. script_path'./data/4.1 감정분류용 데이터셋/script.txt' \
      4. save_path='./data' \
      5. train_size=.8
  • Preprocessed Output (train.pkl)

person_idx audio sentence emotion
0 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … 오늘 입고 나가야지. 행복
2 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … 오늘 입고 나가야지. 행복
7 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … 오늘 입고 나가야지. 행복
12 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … 오늘 입고 나가야지. 행복
17 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … 오늘 입고 나가야지. 행복


Usage

Prerequisites

  • We recommend to use conda environment to setup
    1. conda create -n <your_env_name> python=3.6
    2. conda activate <your_env_name>
    3. conda install pip
    4. pip install -r requirements.txt

Before training,

Download fine-tuned BERT, and locate the model as follows:

  1. korean-audiotext-transformer/
  2. └── KoBERT/
  3. ├── args.bin
  4. ├── config.json
  5. ├── model.bin
  6. ├── tokenization.py
  7. └── vocab.list

Train AudioText Transformer

  • specified hyper-parameters are the best ones on development dataset
  1. python train.py \
  2. --data_path='./data' \
  3. --bert_path='./KoBERT' \
  4. --save_path='./result' \
  5. --attn_dropout=.2 \
  6. --relu_dropout=.1 \
  7. --emb_dropout=.2 \
  8. --res_dropout=.1 \
  9. --out_dropout=.1 \
  10. --n_layers=2 \
  11. --d_model=64 \
  12. --n_heads=8 \
  13. --lr=1e-5 \
  14. --epochs=10 \
  15. --batch_size=64 \
  16. --clip=1.0 \
  17. --warmup_percent=.1 \
  18. --max_len_audio=400 \
  19. --sample_rate=48000 \
  20. --resample_rate=16000 \
  21. --n_fft_size=400 \
  22. --n_mfcc=40

Train Audio-Only Baseline

  1. python train.py --only_audio \
  2. --n_layers=4 \
  3. --n_heads=8 \
  4. --lr=1e-3 \
  5. --epochs=10 \
  6. --batch_size=64 \

Train Text-Only Baseline

  1. python train.py --only_text \
  2. --n_layers=4 \
  3. --n_heads=8 \
  4. --lr=1e-3 \
  5. --epochs=10 \
  6. --batch_size=64 \

Evaluate Models

  1. python eval.py [--FLAGS]


Results

Text-Only Baseline

text_only

Emotion Total 공포 놀람 분노 슬픔 중립 행복 혐오
F1-score 33.95 75.00 33.33 44.44 22.22 18.18 44.44 0.00

text_cm


Audio-Only Baseline

audio_only

Emotion Total 공포 놀람 분노 슬픔 중립 행복 혐오
F1-score 35.28 31.84 42.68 24.71 47.32 35.80 44.52 20.12

audio_cm


Multimodal (Crossmodal) Transformer

crossmodal

Emotion Total 공포 놀람 분노 슬픔 중립 행복 혐오
F1-score 52.54 44.18 34.44 50.95 81.81 34.28 65.93 56.19

cross_cm


References