|Table of Contents|

[1] Zhao Yan, Zhao Li, Lu Cheng, Li Sunan, et al. Multi-head attention-based long short-term memorymodel for speech emotion recognition [J]. Journal of Southeast University (English Edition), 2022, 38 (2): 103-109. [doi:10.3969/j.issn.1003-7985.2022.02.001]
Copy

Multi-head attention-based long short-term memorymodel for speech emotion recognition()
Share:

Journal of Southeast University (English Edition)[ISSN:1003-7985/CN:32-1325/N]

Volumn:
38
Issue:
2022 2
Page:
103-109
Research Field:
Computer Science and Engineering
Publishing date:
2022-06-20

Info

Title:
Multi-head attention-based long short-term memorymodel for speech emotion recognition
Author(s):
Zhao Yan1 Zhao Li1 Lu Cheng1 Li Sunan1 Tang Chuangao2 Lian Hailun1
1School of Information Science and Engineering, Southeast University, Nanjing 210096, China
2School of Biological Science and Medical Engineering, Southeast University, Nanjing 210096, China
Keywords:
speech emotion recognition long short-term memory(LSTM) multi-head attention mechanism frame-level features self-attention
PACS:
TP37
DOI:
10.3969/j.issn.1003-7985.2022.02.001
Abstract:
To fully make use of information from different representation subspaces, a multi-head attention-based long short-term memory(LSTM)model is proposed in this study for speech emotion recognition(SER). The proposed model uses frame-level features and takes the temporal information of emotion speech as the input of the LSTM layer. Here, a multi-head time-dimension attention(MHTA)layer was employed to linearly project the output of the LSTM layer into different subspaces for the reduced-dimension context vectors. To provide relative vital information from other dimensions, the output of MHTA, the output of feature-dimension attention, and the last time-step output of LSTM were utilized to form multiple context vectors as the input of the fully connected layer. To improve the performance of multiple vectors, feature-dimension attention was employed for the all-time output of the first LSTM layer. The proposed model was evaluated on the eNTERFACE and GEMEP corpora, respectively. The results indicate that the proposed model outperforms LSTM by 14.6% and 10.5% for eNTERFACE and GEMEP, respectively, proving the effectiveness of the proposed model in SER tasks.

References:

[1] Cowie R, Douglas-Cowie E, Tsapatsoulis N, et al. Emotion recognition in human-computer interaction[J]. IEEE Signal Processing Magazine, 2001, 18(1):32-80. DOI:10.1109/79.911197.
[2] Anagnostopoulos C N, Iliou T, Giannoukos I. Features and classifiers for emotion recognition from speech:A survey from 2000 to 2011[J]. Artificial Intelligence Review, 2015, 43(2):155-177. DOI:10.1007/s10462-012-9368-5.
[3] El Ayadi M, Kamel M S, Karray F. Survey on speech emotion recognition:Features, classification schemes, and databases[J]. Pattern Recognition, 2011, 44(3):572-587. DOI:10.1016/j.patcog.2010.09.020.
[4] Trigeorgis G, Ringeval F, Brueckner R, et al. Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network[C]//IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). Shanghai, China, 2016:5200-5204. DOI:10.1109/ICASSP.2016.7472669.
[5] Mirsamadi S, Barsoum E, Zhang C. Automatic speech emotion recognition using recurrent neural networks with local attention[C]//IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). New Orleans, LA, USA, 2017:2227-2231. DOI:10.1109/ICASSP.2017.7952552.
[6] Tarantino L, Garner P N, Lazaridis A. Self-attention for speech emotion recognition[C]//Interspeech. Graz, Austria, 2019:2578-2582. DOI:10.21437/Interspeech.2019-2822.
[7] Li Y, Zhao T, Kawahara T. Improved end-to-end speech emotion recognition using self attention mechanism and multitask learning[C]//Interspeech. Graz, Austria, 2019:2803-2807. DOI:10.21437/Interspeech.2019-2594.
[8] Xie Y, Liang R, Liang Z, et al. Speech emotion classification using attention-based LSTM[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019, 27(11):1675-1685. DOI:10.1109/TASLP.2019.2925934.
[9] Li R, Wu Z, Jia J, et al. Dilated residual network with multi-head self-attention for speech emotion recognition[C]//IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). Brighton, UK, 2019:6675-6679. DOI:10.1109/ICASSP.2019.8682154.
[10] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems. Long Beach, CA, USA, 2017:5998-6008. DOI:10.7551/mitpress/8385.003.0003.
[11] Devlin J, Chang M W, Lee K, et al. BERT:Pre-training of deep bidirectional transformers for language understanding[C]//NAACL-HLT. Minneapolis, MN, USA, 2019:4171-4186. DOI:10.18653/v1/n19-1423.
[12] Lian Z, Tao J, Liu B, et al. Unsupervised representation learning with future observation prediction for speech emotion recognition[C]//Interspeech. Graz, Austria, 2019:3840-3844. DOI:10.21437/Interspeech.2019-1582.
[13] Tian Z, Yi J, Tao J, et al. Self-attention transducers for end-to-end speech recognition[C]//Interspeech. Graz, Austria, 2019:4395-4399. DOI:10.21437/Interspeech.2019-2203.
[14] Wang F, Jiang M, Qian C, et al. Residual attention network for image classification[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, Hawaii, USA, 2017:3156-3164. DOI:10.1109/CVPR.2017.683.
[15] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, Nevada, USA, 2016:770-778. DOI:10.1109/CVPR.2016.90.
[16] Eyben F, W�F6;llmer M, Schuller B. Opensmile:The munich versatile and fast open-source audio feature extractor[C]//Proceedings of the 18th ACM international conference on Multimedia. New York, USA, 2010:1459-1462. DOI:10.1145/1873951.1874246.
[17] Schuller B, Steidl S, Batliner A, et al. The INTERSPEECH 2010 paralinguistic challenge[C]//Interspeech. Makuhari, Japan, 2010:2794-2797. DOI:10.21437/Interspeech.2010-739.
[18] Hochreiter S, Schmidhuber J. Long short-term memory[J]. Neural Computation, 1997, 9(8):1735-1780. DOI:10.1162/neco.1997.9.8.1735.
[19] Martin O, Kotsia I, Macq B, et al. The eNTERFACE’05 audio-visual emotion database[C]//International Conference on Data Engineering Workshops. Atlanta, GA, USA, 2006:8.DOI:10.1109/ICDEW.2006.145.
[20] B�E4;nziger T, Mortillaro M, Scherer K R. Introducing the Geneva Multimodal expression corpus for experimental research on emotion perception[J]. Emotion, 2012, 12(5):1161-1179. DOI:10.1037/a0025827.
[21] Xie Y, Liang R, Liang Z, et al. Attention-based dense LSTM for speech emotion recognition[J]. IEICE Transactions on Information and Systems, 2019, 102(7):1426-1429. DOI:10.1587/transinf.2019EDL8019.
[22] Zhao J, Mao X, Chen L. Speech emotion recognition using deep 1D & 2D CNN LSTM networks[J]. Biomedical Signal Processing and Control, 2019, 47:312-323.DOI:10.1016/j.bspc.2018.08.035.

Memo

Memo:
Biographies: Zhao Yan(1993—), male, Ph. D. candidate; Zhao Li(corresponding author), male, doctor, professor, zhaoli@seu.edu.cn.
Foundation items: The National Natural Science Foundation of China(No.61571106, 61633013, 61673108, 81871444).
Citation: Zhao Yan, Zhao Li, Lu Cheng, et al. Multi-head attention-based long short-term memory model for speech emotion recognition[J].Journal of Southeast University(English Edition), 2022, 38(2):103-109.DOI:10.3969/j.issn.1003-7985.2022.02.001.
Last Update: 2022-06-20