|Table of Contents|

[1] Du Jing, Tang Manting, Zhao Li,. Transformer-like model with linear attentionfor speech emotion recognition [J]. Journal of Southeast University (English Edition), 2021, 37 (2): 164-170. [doi:10.3969/j.issn.1003-7985.2021.02.005]
Copy

Transformer-like model with linear attentionfor speech emotion recognition()
Share:

Journal of Southeast University (English Edition)[ISSN:1003-7985/CN:32-1325/N]

Volumn:
37
Issue:
2021 2
Page:
164-170
Research Field:
Information and Communication Engineering
Publishing date:
2021-06-20

Info

Title:
Transformer-like model with linear attentionfor speech emotion recognition
Author(s):
Du Jing1 Tang Manting2 Zhao Li1
1School of Information Science and Engineering, Southeast University, Nanjing 210096, China
2School of Computational Engineering, Jinling Institute of Technology, Nanjing 211169, China
Keywords:
transformer attention mechanism speech emotion recognition fast softmax
PACS:
TN912.3;TP18
DOI:
10.3969/j.issn.1003-7985.2021.02.005
Abstract:
Because of the excellent performance of Transformer in sequence learning tasks, such as natural language processing, an improved Transformer-like model is proposed that is suitable for speech emotion recognition tasks. To alleviate the prohibitive time consumption and memory footprint caused by softmax inside the multihead attention unit in Transformer, a new linear self-attention algorithm is proposed. The original exponential function is replaced by a Taylor series expansion formula. On the basis of the associative property of matrix products, the time and space complexity of softmax operation regarding the input’s length is reduced from O(N2)to O(N), where N is the sequence length. Experimental results on the emotional corpora of two languages show that the proposed linear attention algorithm can achieve similar performance to the original scaled dot product attention, while the training time and memory cost are reduced by half. Furthermore, the improved model obtains more robust performance on speech emotion recognition compared with the original Transformer.

References:

[1] Ak�E7;ay M B, Oguz K.Speech emotion recognition:Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers[J].Speech Communication, 2020, 116:56-76.DOI:10.1016/j.specom.2019.12.001.
[2] Hochreiter S, Schmidhuber J.Long short-term memory[J].Neural Computation, 1997, 9(8):1735-1780.DOI:10.1162/neco.1997.9.8.1735.
[3] Chung J, Gulcehre C, Cho K, et al.Empirical evaluation of gated recurrent neural networks on sequence modeling[EB/OL].(2014)[2020-08-01].https://arxiv.org/abs/1412.3555.
[4] Mirsamadi S, Barsoum E, Zhang C.Automatic speech emotion recognition using recurrent neural networks with local attention[C]//2017 IEEE International Conference on Acoustics, Speech and Signal Processing.New Orleans, LA, USA, 2017:2227-2231.DOI:10.1109/ICASSP.2017.7952552.
[5] Greff K, Srivastava R K, Koutník J, et al.LSTM:A search space odyssey[J].IEEE Transactions on Neural Networks and Learning Systems, 2017, 28(10):2222-2232.DOI:10.1109/TNNLS.2016.2582924.
[6] Thakker U, Dasika G, Beu J, et al.Measuring scheduling efficiency of RNNs for NLP applications[EB/OL].(2019)[2020-08-01].https://arxiv.org/abs/1904.03302.
[7] Vaswani A, Shazeer N, Parmar N, et al.Attention is all you need[C]//Advances in Neural Information Processing Systems.Long Beach, CA, USA, 2017:5998-6008.
[8] India M, Safari P, Hernando J.Self multi-head attention for speaker recognition[C]//Interspeech 2019.Graz, Austrilia, 2019:4305-4309.DOI:10.21437/interspeech.2019-2616.
[9] Busso C, Bulut M, Lee C C, et al.IEMOCAP:interactive emotional dyadic motion capture database[J].Language Resources and Evaluation, 2008, 42(4):335-359.DOI:10.1007/s10579-008-9076-6.
[10] Lian Z, Tao J H, Liu B, et al.Conversational emotion analysis via attention mechanisms[C]//Interspeech 2019.Graz, Austrilia, 2019:1936-1940.DOI:10.21437/interspeech.2019-1577.
[11] Li R N, Wu Z Y, Jia J, et al.Dilated residual network with multi-head self-attention for speech emotion recognition[C]//2019 IEEE International Conference on Acoustics, Speech and Signal Processing.Brighton, UK, 2019:6675-6679.DOI:10.1109/ICASSP.2019.8682154.
[12] Devlin J, Chang M W, Lee K, et al.BERT:Pre-training ofdeep bidirectional transformers for language understanding[EB/OL].(2019)[2020-08-01].https://arxiv.org/abs/1810.04805
[13] Hendrycks D, Gimpel K.Gaussian error linear units(GELUs)[EB/OL].(2016)[2020-08-01].https://arxiv.org/abs/1606.08415.
[14] He K M, Zhang X Y, Ren S Q, et al.Deep residual learning for image recognition[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition(CVPR).Las Vegas, NV, USA, 2016:770-778.DOI:10.1109/CVPR.2016.90.
[15] Burkhardt F, Paeschke A, Rolfes M, et al.A database of German emotional speech[C]//Interspeech 2005.Lisbon, Portugal, 2005:1517-1520.
[16] Latif S, Qayyum A, Usman M, et al.Cross lingual speech emotion recognition:Urdu vs.western languages[C]//2018 International Conference on Frontiers of Information Technology (FIT).Islamabad, Pakistan, 2018:88-93.DOI:10.1109/FIT.2018.00023.
[17] Nediyanchath A, Paramasivam P, Yenigalla P.Multi-head attention for speech emotion recognition with auxiliary learning of gender recognition[C]//2020 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP).Barcelona, Spain, 2020:7179-7183.DOI:10.1109/ICASSP40776.2020.9054073.
[18] Chavan V M, Gohokar V V.Speech emotion recognition by using SVM-classifier[J].International Journal of Engineering & Advanced Technology, 2012(5):11-15.
[19] Xi Y X, Li P C, Song Y, et al.Speaker to emotion:Domain adaptation for speech emotion recognition with residual adapters[C]//2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference.Lanzhou, China, 2019:513-518.DOI:10.1109/APSIPAASC47483.2019.9023339.

Memo

Memo:
Biographies: Du Jing(1997—), female, graduate;Zhao Li(corresponding author), male, doctor, professor, zhaoli@seu.edu.cn.
Foundation items: The National Key Research and Development Program of China(No.2020YFC2004002, 2020YFC2004003), the National Natural Science Foundation of China(No.61871213, 61673108, 61571106).
Citation: Du Jing, Tang Manting, Zhao Li. Transformer-like model with linear attention for speech emotion recognition[J].Journal of Southeast University(English Edition), 2021, 37(2):164-170.DOI:10.3969/j.issn.1003-7985.2021.02.005.
Last Update: 2021-06-20