[1] Cowie R, Douglas-Cowie E, Tsapatsoulis N, et al. Emotion recognition in human-computer interaction[J]. IEEE Signal Processing Magazine, 2001, 18(1):32-80. DOI:10.1109/79.911197.
[2] Anagnostopoulos C N, Iliou T, Giannoukos I. Features and classifiers for emotion recognition from speech:A survey from 2000 to 2011[J]. Artificial Intelligence Review, 2015, 43(2):155-177. DOI:10.1007/s10462-012-9368-5.
[3] El Ayadi M, Kamel M S, Karray F. Survey on speech emotion recognition:Features, classification schemes, and databases[J]. Pattern Recognition, 2011, 44(3):572-587. DOI:10.1016/j.patcog.2010.09.020.
[4] Trigeorgis G, Ringeval F, Brueckner R, et al. Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network[C]//IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). Shanghai, China, 2016:5200-5204. DOI:10.1109/ICASSP.2016.7472669.
[5] Mirsamadi S, Barsoum E, Zhang C. Automatic speech emotion recognition using recurrent neural networks with local attention[C]//IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). New Orleans, LA, USA, 2017:2227-2231. DOI:10.1109/ICASSP.2017.7952552.
[6] Tarantino L, Garner P N, Lazaridis A. Self-attention for speech emotion recognition[C]//Interspeech. Graz, Austria, 2019:2578-2582. DOI:10.21437/Interspeech.2019-2822.
[7] Li Y, Zhao T, Kawahara T. Improved end-to-end speech emotion recognition using self attention mechanism and multitask learning[C]//Interspeech. Graz, Austria, 2019:2803-2807. DOI:10.21437/Interspeech.2019-2594.
[8] Xie Y, Liang R, Liang Z, et al. Speech emotion classification using attention-based LSTM[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019, 27(11):1675-1685. DOI:10.1109/TASLP.2019.2925934.
[9] Li R, Wu Z, Jia J, et al. Dilated residual network with multi-head self-attention for speech emotion recognition[C]//IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). Brighton, UK, 2019:6675-6679. DOI:10.1109/ICASSP.2019.8682154.
[10] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems. Long Beach, CA, USA, 2017:5998-6008. DOI:10.7551/mitpress/8385.003.0003.
[11] Devlin J, Chang M W, Lee K, et al. BERT:Pre-training of deep bidirectional transformers for language understanding[C]//NAACL-HLT. Minneapolis, MN, USA, 2019:4171-4186. DOI:10.18653/v1/n19-1423.
[12] Lian Z, Tao J, Liu B, et al. Unsupervised representation learning with future observation prediction for speech emotion recognition[C]//Interspeech. Graz, Austria, 2019:3840-3844. DOI:10.21437/Interspeech.2019-1582.
[13] Tian Z, Yi J, Tao J, et al. Self-attention transducers for end-to-end speech recognition[C]//Interspeech. Graz, Austria, 2019:4395-4399. DOI:10.21437/Interspeech.2019-2203.
[14] Wang F, Jiang M, Qian C, et al. Residual attention network for image classification[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, Hawaii, USA, 2017:3156-3164. DOI:10.1109/CVPR.2017.683.
[15] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, Nevada, USA, 2016:770-778. DOI:10.1109/CVPR.2016.90.
[16] Eyben F, WF6;llmer M, Schuller B. Opensmile:The munich versatile and fast open-source audio feature extractor[C]//Proceedings of the 18th ACM international conference on Multimedia. New York, USA, 2010:1459-1462. DOI:10.1145/1873951.1874246.
[17] Schuller B, Steidl S, Batliner A, et al. The INTERSPEECH 2010 paralinguistic challenge[C]//Interspeech. Makuhari, Japan, 2010:2794-2797. DOI:10.21437/Interspeech.2010-739.
[18] Hochreiter S, Schmidhuber J. Long short-term memory[J]. Neural Computation, 1997, 9(8):1735-1780. DOI:10.1162/neco.1997.9.8.1735.
[19] Martin O, Kotsia I, Macq B, et al. The eNTERFACE’05 audio-visual emotion database[C]//International Conference on Data Engineering Workshops. Atlanta, GA, USA, 2006:8.DOI:10.1109/ICDEW.2006.145.
[20] BE4;nziger T, Mortillaro M, Scherer K R. Introducing the Geneva Multimodal expression corpus for experimental research on emotion perception[J]. Emotion, 2012, 12(5):1161-1179. DOI:10.1037/a0025827.
[21] Xie Y, Liang R, Liang Z, et al. Attention-based dense LSTM for speech emotion recognition[J]. IEICE Transactions on Information and Systems, 2019, 102(7):1426-1429. DOI:10.1587/transinf.2019EDL8019.
[22] Zhao J, Mao X, Chen L. Speech emotion recognition using deep 1D & 2D CNN LSTM networks[J]. Biomedical Signal Processing and Control, 2019, 47:312-323.DOI:10.1016/j.bspc.2018.08.035.