|Table of Contents|

[1] Zhao Yan, Zhao Li, Lu Cheng, Li Sunan, et al. Multi-head attention-based long short-term memorymodel for speech emotion recognition [J]. Journal of Southeast University (English Edition), 2022, 38 (2): 103-109. [doi:10.3969/j.issn.1003-7985.2022.02.001]
Copy

Multi-head attention-based long short-term memorymodel for speech emotion recognition()
基于多头注意力长短期记忆模型的语音情感识别方法
Share:

Journal of Southeast University (English Edition)[ISSN:1003-7985/CN:32-1325/N]

Volumn:
38
Issue:
2022 2
Page:
103-109
Research Field:
Computer Science and Engineering
Publishing date:
2022-06-20

Info

Title:
Multi-head attention-based long short-term memorymodel for speech emotion recognition
基于多头注意力长短期记忆模型的语音情感识别方法
Author(s):
Zhao Yan1 Zhao Li1 Lu Cheng1 Li Sunan1 Tang Chuangao2 Lian Hailun1
1School of Information Science and Engineering, Southeast University, Nanjing 210096, China
2School of Biological Science and Medical Engineering, Southeast University, Nanjing 210096, China
赵焱1 赵力1 路成1 李溯南1 唐传高2 连海伦1
1东南大学信息科学与工程学院, 南京 210096; 2东南大学生物科学与医学工程学院, 南京 210096
Keywords:
speech emotion recognition long short-term memory(LSTM) multi-head attention mechanism frame-level features self-attention
语音情感识别 长短期记忆 多头注意力机制 帧级别特征 自注意力
PACS:
TP37
DOI:
10.3969/j.issn.1003-7985.2022.02.001
Abstract:
To fully make use of information from different representation subspaces, a multi-head attention-based long short-term memory(LSTM)model is proposed in this study for speech emotion recognition(SER). The proposed model uses frame-level features and takes the temporal information of emotion speech as the input of the LSTM layer. Here, a multi-head time-dimension attention(MHTA)layer was employed to linearly project the output of the LSTM layer into different subspaces for the reduced-dimension context vectors. To provide relative vital information from other dimensions, the output of MHTA, the output of feature-dimension attention, and the last time-step output of LSTM were utilized to form multiple context vectors as the input of the fully connected layer. To improve the performance of multiple vectors, feature-dimension attention was employed for the all-time output of the first LSTM layer. The proposed model was evaluated on the eNTERFACE and GEMEP corpora, respectively. The results indicate that the proposed model outperforms LSTM by 14.6% and 10.5% for eNTERFACE and GEMEP, respectively, proving the effectiveness of the proposed model in SER tasks.
针对语音情感识别中不同表征空间的信息利用不足问题, 提出了一种多头注意力的双层长短时记忆模型, 用于充分挖掘有效的情感信息.该模型以具有时序情感信息的帧级别特征作为输入值, 利用长短时记忆模块学习时域特征, 设计了特征注意力模块和时间多头注意力模块, 对长短时记忆模块的逐层输出值、特征注意力模块输出值、时间多头注意力模块输出值进行融合.结果表明, 相比传统的长短时记忆模型, 所提方法在eENTERFACE和GEMEP两个数据集上的识别准确率分别提升了14.6%和10.5%, 从而证明了其在语音情感识别任务中的有效性.

References:

[1] Cowie R, Douglas-Cowie E, Tsapatsoulis N, et al. Emotion recognition in human-computer interaction[J]. IEEE Signal Processing Magazine, 2001, 18(1):32-80. DOI:10.1109/79.911197.
[2] Anagnostopoulos C N, Iliou T, Giannoukos I. Features and classifiers for emotion recognition from speech:A survey from 2000 to 2011[J]. Artificial Intelligence Review, 2015, 43(2):155-177. DOI:10.1007/s10462-012-9368-5.
[3] El Ayadi M, Kamel M S, Karray F. Survey on speech emotion recognition:Features, classification schemes, and databases[J]. Pattern Recognition, 2011, 44(3):572-587. DOI:10.1016/j.patcog.2010.09.020.
[4] Trigeorgis G, Ringeval F, Brueckner R, et al. Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network[C]//IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). Shanghai, China, 2016:5200-5204. DOI:10.1109/ICASSP.2016.7472669.
[5] Mirsamadi S, Barsoum E, Zhang C. Automatic speech emotion recognition using recurrent neural networks with local attention[C]//IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). New Orleans, LA, USA, 2017:2227-2231. DOI:10.1109/ICASSP.2017.7952552.
[6] Tarantino L, Garner P N, Lazaridis A. Self-attention for speech emotion recognition[C]//Interspeech. Graz, Austria, 2019:2578-2582. DOI:10.21437/Interspeech.2019-2822.
[7] Li Y, Zhao T, Kawahara T. Improved end-to-end speech emotion recognition using self attention mechanism and multitask learning[C]//Interspeech. Graz, Austria, 2019:2803-2807. DOI:10.21437/Interspeech.2019-2594.
[8] Xie Y, Liang R, Liang Z, et al. Speech emotion classification using attention-based LSTM[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019, 27(11):1675-1685. DOI:10.1109/TASLP.2019.2925934.
[9] Li R, Wu Z, Jia J, et al. Dilated residual network with multi-head self-attention for speech emotion recognition[C]//IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). Brighton, UK, 2019:6675-6679. DOI:10.1109/ICASSP.2019.8682154.
[10] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems. Long Beach, CA, USA, 2017:5998-6008. DOI:10.7551/mitpress/8385.003.0003.
[11] Devlin J, Chang M W, Lee K, et al. BERT:Pre-training of deep bidirectional transformers for language understanding[C]//NAACL-HLT. Minneapolis, MN, USA, 2019:4171-4186. DOI:10.18653/v1/n19-1423.
[12] Lian Z, Tao J, Liu B, et al. Unsupervised representation learning with future observation prediction for speech emotion recognition[C]//Interspeech. Graz, Austria, 2019:3840-3844. DOI:10.21437/Interspeech.2019-1582.
[13] Tian Z, Yi J, Tao J, et al. Self-attention transducers for end-to-end speech recognition[C]//Interspeech. Graz, Austria, 2019:4395-4399. DOI:10.21437/Interspeech.2019-2203.
[14] Wang F, Jiang M, Qian C, et al. Residual attention network for image classification[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, Hawaii, USA, 2017:3156-3164. DOI:10.1109/CVPR.2017.683.
[15] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, Nevada, USA, 2016:770-778. DOI:10.1109/CVPR.2016.90.
[16] Eyben F, W�F6;llmer M, Schuller B. Opensmile:The munich versatile and fast open-source audio feature extractor[C]//Proceedings of the 18th ACM international conference on Multimedia. New York, USA, 2010:1459-1462. DOI:10.1145/1873951.1874246.
[17] Schuller B, Steidl S, Batliner A, et al. The INTERSPEECH 2010 paralinguistic challenge[C]//Interspeech. Makuhari, Japan, 2010:2794-2797. DOI:10.21437/Interspeech.2010-739.
[18] Hochreiter S, Schmidhuber J. Long short-term memory[J]. Neural Computation, 1997, 9(8):1735-1780. DOI:10.1162/neco.1997.9.8.1735.
[19] Martin O, Kotsia I, Macq B, et al. The eNTERFACE’05 audio-visual emotion database[C]//International Conference on Data Engineering Workshops. Atlanta, GA, USA, 2006:8.DOI:10.1109/ICDEW.2006.145.
[20] B�E4;nziger T, Mortillaro M, Scherer K R. Introducing the Geneva Multimodal expression corpus for experimental research on emotion perception[J]. Emotion, 2012, 12(5):1161-1179. DOI:10.1037/a0025827.
[21] Xie Y, Liang R, Liang Z, et al. Attention-based dense LSTM for speech emotion recognition[J]. IEICE Transactions on Information and Systems, 2019, 102(7):1426-1429. DOI:10.1587/transinf.2019EDL8019.
[22] Zhao J, Mao X, Chen L. Speech emotion recognition using deep 1D & 2D CNN LSTM networks[J]. Biomedical Signal Processing and Control, 2019, 47:312-323.DOI:10.1016/j.bspc.2018.08.035.

Memo

Memo:
Biographies: Zhao Yan(1993—), male, Ph. D. candidate; Zhao Li(corresponding author), male, doctor, professor, zhaoli@seu.edu.cn.
Foundation items: The National Natural Science Foundation of China(No.61571106, 61633013, 61673108, 81871444).
Citation: Zhao Yan, Zhao Li, Lu Cheng, et al. Multi-head attention-based long short-term memory model for speech emotion recognition[J].Journal of Southeast University(English Edition), 2022, 38(2):103-109.DOI:10.3969/j.issn.1003-7985.2022.02.001.
Last Update: 2022-06-20