|Table of Contents|

[1] Du Jing, Tang Manting, Zhao Li,. Transformer-like model with linear attentionfor speech emotion recognition [J]. Journal of Southeast University (English Edition), 2021, 37 (2): 164-170. [doi:10.3969/j.issn.1003-7985.2021.02.005]
Copy

Transformer-like model with linear attentionfor speech emotion recognition()
基于线性注意力和类Transformer模型的语音情感识别
Share:

Journal of Southeast University (English Edition)[ISSN:1003-7985/CN:32-1325/N]

Volumn:
37
Issue:
2021 2
Page:
164-170
Research Field:
Information and Communication Engineering
Publishing date:
2021-06-20

Info

Title:
Transformer-like model with linear attentionfor speech emotion recognition
基于线性注意力和类Transformer模型的语音情感识别
Author(s):
Du Jing1 Tang Manting2 Zhao Li1
1School of Information Science and Engineering, Southeast University, Nanjing 210096, China
2School of Computational Engineering, Jinling Institute of Technology, Nanjing 211169, China
杜静1 唐曼婷2 赵力1
1东南大学信息科学与工程学院, 南京 210096; 2金陵科技学院计算机工程学院, 南京 211169
Keywords:
transformer attention mechanism speech emotion recognition fast softmax
Transformer 注意力机制 语音情感识别 快速softmax
PACS:
TN912.3;TP18
DOI:
10.3969/j.issn.1003-7985.2021.02.005
Abstract:
Because of the excellent performance of Transformer in sequence learning tasks, such as natural language processing, an improved Transformer-like model is proposed that is suitable for speech emotion recognition tasks. To alleviate the prohibitive time consumption and memory footprint caused by softmax inside the multihead attention unit in Transformer, a new linear self-attention algorithm is proposed. The original exponential function is replaced by a Taylor series expansion formula. On the basis of the associative property of matrix products, the time and space complexity of softmax operation regarding the input’s length is reduced from O(N2)to O(N), where N is the sequence length. Experimental results on the emotional corpora of two languages show that the proposed linear attention algorithm can achieve similar performance to the original scaled dot product attention, while the training time and memory cost are reduced by half. Furthermore, the improved model obtains more robust performance on speech emotion recognition compared with the original Transformer.
鉴于Transformer模型在自然语言处理等序列任务中的优异性能, 提出了一种适用于语音情感识别任务的改进的类Transformer模型.为了减小Transformer模型中多头注意力单元内部由softmax运算引起的巨大时间消耗与内存开销, 提出了一种新的线性自注意力计算方法, 通过使用泰勒级数展开公式代替原来的指数函数, 并根据矩阵乘积的关联性将softmax运算相对于输入序列长度的时间复杂度和空间复杂度从O(N2)降至O(N), 其中N为序列长度.在2个不同语言的情感语料库上进行实验.结果表明:所提出的线性注意力算法可获得与原始缩放点积注意力相近的性能, 而模型训练过程中的时间和内存开销大幅降低;与原始的Transformer模型相比, 改进后的模型具有更鲁棒的语音情感识别性能.

References:

[1] Ak�E7;ay M B, Oguz K.Speech emotion recognition:Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers[J].Speech Communication, 2020, 116:56-76.DOI:10.1016/j.specom.2019.12.001.
[2] Hochreiter S, Schmidhuber J.Long short-term memory[J].Neural Computation, 1997, 9(8):1735-1780.DOI:10.1162/neco.1997.9.8.1735.
[3] Chung J, Gulcehre C, Cho K, et al.Empirical evaluation of gated recurrent neural networks on sequence modeling[EB/OL].(2014)[2020-08-01].https://arxiv.org/abs/1412.3555.
[4] Mirsamadi S, Barsoum E, Zhang C.Automatic speech emotion recognition using recurrent neural networks with local attention[C]//2017 IEEE International Conference on Acoustics, Speech and Signal Processing.New Orleans, LA, USA, 2017:2227-2231.DOI:10.1109/ICASSP.2017.7952552.
[5] Greff K, Srivastava R K, Koutník J, et al.LSTM:A search space odyssey[J].IEEE Transactions on Neural Networks and Learning Systems, 2017, 28(10):2222-2232.DOI:10.1109/TNNLS.2016.2582924.
[6] Thakker U, Dasika G, Beu J, et al.Measuring scheduling efficiency of RNNs for NLP applications[EB/OL].(2019)[2020-08-01].https://arxiv.org/abs/1904.03302.
[7] Vaswani A, Shazeer N, Parmar N, et al.Attention is all you need[C]//Advances in Neural Information Processing Systems.Long Beach, CA, USA, 2017:5998-6008.
[8] India M, Safari P, Hernando J.Self multi-head attention for speaker recognition[C]//Interspeech 2019.Graz, Austrilia, 2019:4305-4309.DOI:10.21437/interspeech.2019-2616.
[9] Busso C, Bulut M, Lee C C, et al.IEMOCAP:interactive emotional dyadic motion capture database[J].Language Resources and Evaluation, 2008, 42(4):335-359.DOI:10.1007/s10579-008-9076-6.
[10] Lian Z, Tao J H, Liu B, et al.Conversational emotion analysis via attention mechanisms[C]//Interspeech 2019.Graz, Austrilia, 2019:1936-1940.DOI:10.21437/interspeech.2019-1577.
[11] Li R N, Wu Z Y, Jia J, et al.Dilated residual network with multi-head self-attention for speech emotion recognition[C]//2019 IEEE International Conference on Acoustics, Speech and Signal Processing.Brighton, UK, 2019:6675-6679.DOI:10.1109/ICASSP.2019.8682154.
[12] Devlin J, Chang M W, Lee K, et al.BERT:Pre-training ofdeep bidirectional transformers for language understanding[EB/OL].(2019)[2020-08-01].https://arxiv.org/abs/1810.04805
[13] Hendrycks D, Gimpel K.Gaussian error linear units(GELUs)[EB/OL].(2016)[2020-08-01].https://arxiv.org/abs/1606.08415.
[14] He K M, Zhang X Y, Ren S Q, et al.Deep residual learning for image recognition[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition(CVPR).Las Vegas, NV, USA, 2016:770-778.DOI:10.1109/CVPR.2016.90.
[15] Burkhardt F, Paeschke A, Rolfes M, et al.A database of German emotional speech[C]//Interspeech 2005.Lisbon, Portugal, 2005:1517-1520.
[16] Latif S, Qayyum A, Usman M, et al.Cross lingual speech emotion recognition:Urdu vs.western languages[C]//2018 International Conference on Frontiers of Information Technology (FIT).Islamabad, Pakistan, 2018:88-93.DOI:10.1109/FIT.2018.00023.
[17] Nediyanchath A, Paramasivam P, Yenigalla P.Multi-head attention for speech emotion recognition with auxiliary learning of gender recognition[C]//2020 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP).Barcelona, Spain, 2020:7179-7183.DOI:10.1109/ICASSP40776.2020.9054073.
[18] Chavan V M, Gohokar V V.Speech emotion recognition by using SVM-classifier[J].International Journal of Engineering & Advanced Technology, 2012(5):11-15.
[19] Xi Y X, Li P C, Song Y, et al.Speaker to emotion:Domain adaptation for speech emotion recognition with residual adapters[C]//2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference.Lanzhou, China, 2019:513-518.DOI:10.1109/APSIPAASC47483.2019.9023339.

Memo

Memo:
Biographies: Du Jing(1997—), female, graduate;Zhao Li(corresponding author), male, doctor, professor, zhaoli@seu.edu.cn.
Foundation items: The National Key Research and Development Program of China(No.2020YFC2004002, 2020YFC2004003), the National Natural Science Foundation of China(No.61871213, 61673108, 61571106).
Citation: Du Jing, Tang Manting, Zhao Li. Transformer-like model with linear attention for speech emotion recognition[J].Journal of Southeast University(English Edition), 2021, 37(2):164-170.DOI:10.3969/j.issn.1003-7985.2021.02.005.
Last Update: 2021-06-20