|Table of Contents|

[1] He Zhengran, Shen Qifan, Wu Jiaxin, Xu Mengyao, et al. Transformer encoder-based multilevel representationswith fusion feature input for speech emotion recognition [J]. Journal of Southeast University (English Edition), 2023, 39 (1): 68-73. [doi:10.3969/j.issn.1003-7985.2023.01.008]
Copy

Transformer encoder-based multilevel representationswith fusion feature input for speech emotion recognition()
Share:

Journal of Southeast University (English Edition)[ISSN:1003-7985/CN:32-1325/N]

Volumn:
39
Issue:
2023 1
Page:
68-73
Research Field:
Computer Science and Engineering
Publishing date:
2023-03-20

Info

Title:
Transformer encoder-based multilevel representationswith fusion feature input for speech emotion recognition
Author(s):
He Zhengran1 Shen Qifan1 Wu Jiaxin2 Xu Mengyao3 Zhao Li1
1School of Information Science and Engineering, Southeast University, Nanjing 210096, China
2School of Electronic Science and Engineering, Southeast University, Nanjing 210096, China
3School of Computer Science and Software Engineering, University of Stirling, Stirling FK9 4LA, UK
Keywords:
speech emotion recognition transformer multihead attention mechanism fusion feature
PACS:
TP391.42
DOI:
10.3969/j.issn.1003-7985.2023.01.008
Abstract:
To improve the accuracy of speech emotion recognition(SER), the possibility of applying transformer-based SER is explored. The log Mel-scale spectrogram and its first-order differential feature are fused as the input to extract hierarchical speech representations using the transformer. The effects of the variation in the number of attention heads and the number of transformer-encoder layers on the recognition accuracy are discussed. The results show that the accuracy of the proposed model increased by 13.98%, 8.14%, 24.34%, 8.16%, and 20.9% compared with that of the transformer with the Mel-frequency cepstral coefficient as featured on the ABC, CASIA, DES, EMODB, and IEMOCAP databases, respectively. Compared with recurrent neural networks, convolutional neural networks, transformer-based models, and other models, the proposed model performs better.

References:

[1] Li D, Liu J, Yang Z, et al. Speech emotion recognition using recurrent neural networks with directional self-attention[J].Expert Systems with Applications, 2021, 173(3):114683. DOI: 10.1016/j.eswa.2021.114683.
[2] Issa D, Demirci M F, Yazici A. Speech emotion recognition with deep convolutional neural networks[J].Biomedical Signal Processing and Control, 2020, 59:101894. DOI: 10.1016/j.bspc.2020.101894.
[3] Chen M, He X, Jing Y, et al. 3-D convolutional recurrent neural networks with attention model for speech emotion recognition[J].IEEE Signal Processing Letters, 2018, 25(10):1440-1444. DOI: 10.1109/LSP.2018.2860246.
[4] Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate[C]//2015 International Conference on Learning Representations. San Diego, CA, USA, 2015:1-15. DOI: 10.48550/arXiv.1409.0473.
[5] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//31st Conference on Neural Information Processing Systems. Long Beach, CA, USA, 2017:1-15. DOI: 10.48550/arXiv.1706.03762.
[6] Cheng J, Dong L, Lapata M. Long short-term memory-networks for machine reading[C]//2016 Conference on Empirical Methods in Natural Language Processing. Austin, TX, USA, 2016:16. DOI: 10.18653/v1/D16-1053.
[7] Wang K, An N, Li B N, et al. Speech emotion recognition using fourier parameters[J].IEEE Transactions on Affective Computing, 2017, 6(1):69-75. DOI: 10.1109/TAFFC.2015.2392101.
[8] Inger S E, Anya V H. Documentation of the danish emotional speech database(DES)[R]. Aalborg, Denmark: Center for Person Kommunikation, 1996.
[9] Burkhardt F, Paeschke A, Rolfes M, et al. A database of german emotional speech[C]//9th European Conference on Speech Communication and Technology. Lisbon, Portugal, 2005:15-30.
[10] Busso C, Bulut M, Lee C C, et al. IEMOCAP: Interactive emotional dyadic motion capture database[J].Language Resources and Evaluation, 2008, 42(4):335-359.

Memo

Memo:
Biographies: He Zhengran(1993—), male, Ph.D. candidate; Zhao Li(corresponding author), male, doctor, professor, zhaoli@seu.deu.cn.
Foundation item: The Key Research and Development Program of Jiangsu Province(No. BE2022059-3).
Citation: He Zhengran, Shen Qifan, Wu Jiaxin, et al.Transformer encoder-based multilevel representations with fusion feature input for speech emotion recognition[J].Journal of Southeast University(English Edition), 2023, 39(1):68-73.DOI:10.3969/j.issn.1003-7985.2023.01.008.
Last Update: 2023-03-20