«Previous Article|Table of Contents|Next Article»

[1] Zhao Yan, Zhao Li, Lu Cheng, Li Sunan, et al. Multi-head attention-based long short-term memorymodel for speech emotion recognition [J]. Journal of Southeast University (English Edition), 2022, 38 (2): 103-109. [doi:10.3969/j.issn.1003-7985.2022.02.001]
Copy

Multi-head attention-based long short-term memorymodel for speech emotion recognition()

基于多头注意力长短期记忆模型的语音情感识别方法

Share：

Journal of Southeast University (English Edition)[ISSN:1003-7985/CN:32-1325/N]

Volumn:: 38
Issue:: 2022 2

Page:: 103-109

Research Field:: Computer Science and Engineering

Publishing date:: 2022-06-20

Info

Title:: Multi-head attention-based long short-term memorymodel for speech emotion recognition

: 基于多头注意力长短期记忆模型的语音情感识别方法

Author(s):: Zhao Yan¹, Zhao Li¹, Lu Cheng¹, Li Sunan¹, Tang Chuangao², Lian Hailun¹; ¹School of Information Science and Engineering, Southeast University, Nanjing 210096, China
²School of Biological Science and Medical Engineering, Southeast University, Nanjing 210096, China

: 赵焱¹, 赵力¹, 路成¹, 李溯南¹, 唐传高², 连海伦¹; ¹东南大学信息科学与工程学院, 南京 210096; ²东南大学生物科学与医学工程学院, 南京 210096

Keywords:: speech emotion recognition; long short-term memory(LSTM); multi-head attention mechanism; frame-level features; self-attention

: 语音情感识别; 长短期记忆; 多头注意力机制; 帧级别特征; 自注意力

PACS:: TP37

DOI:: 10.3969/j.issn.1003-7985.2022.02.001

Abstract:: To fully make use of information from different representation subspaces, a multi-head attention-based long short-term memory(LSTM)model is proposed in this study for speech emotion recognition(SER). The proposed model uses frame-level features and takes the temporal information of emotion speech as the input of the LSTM layer. Here, a multi-head time-dimension attention(MHTA)layer was employed to linearly project the output of the LSTM layer into different subspaces for the reduced-dimension context vectors. To provide relative vital information from other dimensions, the output of MHTA, the output of feature-dimension attention, and the last time-step output of LSTM were utilized to form multiple context vectors as the input of the fully connected layer. To improve the performance of multiple vectors, feature-dimension attention was employed for the all-time output of the first LSTM layer. The proposed model was evaluated on the eNTERFACE and GEMEP corpora, respectively. The results indicate that the proposed model outperforms LSTM by 14.6% and 10.5% for eNTERFACE and GEMEP, respectively, proving the effectiveness of the proposed model in SER tasks.

: 针对语音情感识别中不同表征空间的信息利用不足问题, 提出了一种多头注意力的双层长短时记忆模型, 用于充分挖掘有效的情感信息.该模型以具有时序情感信息的帧级别特征作为输入值, 利用长短时记忆模块学习时域特征, 设计了特征注意力模块和时间多头注意力模块, 对长短时记忆模块的逐层输出值、特征注意力模块输出值、时间多头注意力模块输出值进行融合.结果表明, 相比传统的长短时记忆模型, 所提方法在eENTERFACE和GEMEP两个数据集上的识别准确率分别提升了14.6%和10.5%, 从而证明了其在语音情感识别任务中的有效性.

References:

[1] Cowie R, Douglas-Cowie E, Tsapatsoulis N, et al. Emotion recognition in human-computer interaction[J]. IEEE Signal Processing Magazine, 2001, 18(1):32-80. DOI:10.1109/79.911197.
[2] Anagnostopoulos C N, Iliou T, Giannoukos I. Features and classifiers for emotion recognition from speech:A survey from 2000 to 2011[J]. Artificial Intelligence Review, 2015, 43(2):155-177. DOI:10.1007/s10462-012-9368-5.
[3] El Ayadi M, Kamel M S, Karray F. Survey on speech emotion recognition:Features, classification schemes, and databases[J]. Pattern Recognition, 2011, 44(3):572-587. DOI:10.1016/j.patcog.2010.09.020.
[4] Trigeorgis G, Ringeval F, Brueckner R, et al. Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network[C]//IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). Shanghai, China, 2016:5200-5204. DOI:10.1109/ICASSP.2016.7472669.
[5] Mirsamadi S, Barsoum E, Zhang C. Automatic speech emotion recognition using recurrent neural networks with local attention[C]//IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). New Orleans, LA, USA, 2017:2227-2231. DOI:10.1109/ICASSP.2017.7952552.
[6] Tarantino L, Garner P N, Lazaridis A. Self-attention for speech emotion recognition[C]//Interspeech. Graz, Austria, 2019:2578-2582. DOI:10.21437/Interspeech.2019-2822.
[7] Li Y, Zhao T, Kawahara T. Improved end-to-end speech emotion recognition using self attention mechanism and multitask learning[C]//Interspeech. Graz, Austria, 2019:2803-2807. DOI:10.21437/Interspeech.2019-2594.
[8] Xie Y, Liang R, Liang Z, et al. Speech emotion classification using attention-based LSTM[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019, 27(11):1675-1685. DOI:10.1109/TASLP.2019.2925934.
[9] Li R, Wu Z, Jia J, et al. Dilated residual network with multi-head self-attention for speech emotion recognition[C]//IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). Brighton, UK, 2019:6675-6679. DOI:10.1109/ICASSP.2019.8682154.
[10] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems. Long Beach, CA, USA, 2017:5998-6008. DOI:10.7551/mitpress/8385.003.0003.
[11] Devlin J, Chang M W, Lee K, et al. BERT:Pre-training of deep bidirectional transformers for language understanding[C]//NAACL-HLT. Minneapolis, MN, USA, 2019:4171-4186. DOI:10.18653/v1/n19-1423.
[12] Lian Z, Tao J, Liu B, et al. Unsupervised representation learning with future observation prediction for speech emotion recognition[C]//Interspeech. Graz, Austria, 2019:3840-3844. DOI:10.21437/Interspeech.2019-1582.
[13] Tian Z, Yi J, Tao J, et al. Self-attention transducers for end-to-end speech recognition[C]//Interspeech. Graz, Austria, 2019:4395-4399. DOI:10.21437/Interspeech.2019-2203.
[14] Wang F, Jiang M, Qian C, et al. Residual attention network for image classification[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, Hawaii, USA, 2017:3156-3164. DOI:10.1109/CVPR.2017.683.
[15] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, Nevada, USA, 2016:770-778. DOI:10.1109/CVPR.2016.90.
[16] Eyben F, W�F6;llmer M, Schuller B. Opensmile:The munich versatile and fast open-source audio feature extractor[C]//Proceedings of the 18th ACM international conference on Multimedia. New York, USA, 2010:1459-1462. DOI:10.1145/1873951.1874246.
[17] Schuller B, Steidl S, Batliner A, et al. The INTERSPEECH 2010 paralinguistic challenge[C]//Interspeech. Makuhari, Japan, 2010:2794-2797. DOI:10.21437/Interspeech.2010-739.
[18] Hochreiter S, Schmidhuber J. Long short-term memory[J]. Neural Computation, 1997, 9(8):1735-1780. DOI:10.1162/neco.1997.9.8.1735.
[19] Martin O, Kotsia I, Macq B, et al. The eNTERFACE’05 audio-visual emotion database[C]//International Conference on Data Engineering Workshops. Atlanta, GA, USA, 2006:8.DOI:10.1109/ICDEW.2006.145.
[20] B�E4;nziger T, Mortillaro M, Scherer K R. Introducing the Geneva Multimodal expression corpus for experimental research on emotion perception[J]. Emotion, 2012, 12(5):1161-1179. DOI:10.1037/a0025827.
[21] Xie Y, Liang R, Liang Z, et al. Attention-based dense LSTM for speech emotion recognition[J]. IEICE Transactions on Information and Systems, 2019, 102(7):1426-1429. DOI:10.1587/transinf.2019EDL8019.
[22] Zhao J, Mao X, Chen L. Speech emotion recognition using deep 1D & 2D CNN LSTM networks[J]. Biomedical Signal Processing and Control, 2019, 47:312-323.DOI:10.1016/j.bspc.2018.08.035.

Memo

Memo:: Biographies: Zhao Yan(1993—), male, Ph. D. candidate; Zhao Li(corresponding author), male, doctor, professor, zhaoli@seu.edu.cn.
Foundation items: The National Natural Science Foundation of China(No.61571106, 61633013, 61673108, 81871444).
Citation: Zhao Yan, Zhao Li, Lu Cheng, et al. Multi-head attention-based long short-term memory model for speech emotion recognition[J].Journal of Southeast University(English Edition), 2022, 38(2):103-109.DOI:10.3969/j.issn.1003-7985.2022.02.001.

Last Update: 2022-06-20

Journal of Southeast University (English Edition)[ISSN:1003-7985/CN:32-1325/N]

Info

References:

Memo

Common functions

Navigate

Tools

Statistics