|Table of Contents|

[1] He Zhengran, Shen Qifan, Wu Jiaxin, Xu Mengyao, et al. Transformer encoder-based multilevel representationswith fusion feature input for speech emotion recognition [J]. Journal of Southeast University (English Edition), 2023, 39 (1): 68-73. [doi:10.3969/j.issn.1003-7985.2023.01.008]
Copy

Transformer encoder-based multilevel representationswith fusion feature input for speech emotion recognition()
基于Transformer编码器的多级表示 与融合特征输入的语音情感识别方法
Share:

Journal of Southeast University (English Edition)[ISSN:1003-7985/CN:32-1325/N]

Volumn:
39
Issue:
2023 1
Page:
68-73
Research Field:
Computer Science and Engineering
Publishing date:
2023-03-20

Info

Title:
Transformer encoder-based multilevel representationswith fusion feature input for speech emotion recognition
基于Transformer编码器的多级表示 与融合特征输入的语音情感识别方法
Author(s):
He Zhengran1 Shen Qifan1 Wu Jiaxin2 Xu Mengyao3 Zhao Li1
1School of Information Science and Engineering, Southeast University, Nanjing 210096, China
2School of Electronic Science and Engineering, Southeast University, Nanjing 210096, China
3School of Computer Science and Software Engineering, University of Stirling, Stirling FK9 4LA, UK
贺正然1 沈起帆1 吴佳欣2 徐梦瑶3 赵力1
1东南大学信息科学与工程学院, 南京 210096; 2东南大学微电子学院, 南京 210096; 3School of Computer Science and Software Engineering, University of Stirling, Stirling FK9 4LA, UK
Keywords:
speech emotion recognition transformer multihead attention mechanism fusion feature
语音情感识别 Transformer 多头注意力机制 融合特征
PACS:
TP391.42
DOI:
10.3969/j.issn.1003-7985.2023.01.008
Abstract:
To improve the accuracy of speech emotion recognition(SER), the possibility of applying transformer-based SER is explored. The log Mel-scale spectrogram and its first-order differential feature are fused as the input to extract hierarchical speech representations using the transformer. The effects of the variation in the number of attention heads and the number of transformer-encoder layers on the recognition accuracy are discussed. The results show that the accuracy of the proposed model increased by 13.98%, 8.14%, 24.34%, 8.16%, and 20.9% compared with that of the transformer with the Mel-frequency cepstral coefficient as featured on the ABC, CASIA, DES, EMODB, and IEMOCAP databases, respectively. Compared with recurrent neural networks, convolutional neural networks, transformer-based models, and other models, the proposed model performs better.
为了提高语音情感识别的准确度, 探讨了将Transformer应用于语音情感识别的可能性.将对数梅尔尺度谱图及其一阶差分特征相融合作为输入, 使用Transformer来提取分层语音表示, 分析注意头个数和Transformer编码器层数的变化对识别精度的影响.结果表明, 在ABC、CASIA、DES、EMODB和IEMOCAP语音情感数据库上, 相比以MFCC为特征的Transformer, 所提模型的精度分别提高了13.98%、8.14%、24.34%、8.16%和20.9%.该模型表现优于递归神经网络(RNN)、卷积神经网络(CNN)、Transformer等其他模型.

References:

[1] Li D, Liu J, Yang Z, et al. Speech emotion recognition using recurrent neural networks with directional self-attention[J].Expert Systems with Applications, 2021, 173(3):114683. DOI: 10.1016/j.eswa.2021.114683.
[2] Issa D, Demirci M F, Yazici A. Speech emotion recognition with deep convolutional neural networks[J].Biomedical Signal Processing and Control, 2020, 59:101894. DOI: 10.1016/j.bspc.2020.101894.
[3] Chen M, He X, Jing Y, et al. 3-D convolutional recurrent neural networks with attention model for speech emotion recognition[J].IEEE Signal Processing Letters, 2018, 25(10):1440-1444. DOI: 10.1109/LSP.2018.2860246.
[4] Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate[C]//2015 International Conference on Learning Representations. San Diego, CA, USA, 2015:1-15. DOI: 10.48550/arXiv.1409.0473.
[5] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//31st Conference on Neural Information Processing Systems. Long Beach, CA, USA, 2017:1-15. DOI: 10.48550/arXiv.1706.03762.
[6] Cheng J, Dong L, Lapata M. Long short-term memory-networks for machine reading[C]//2016 Conference on Empirical Methods in Natural Language Processing. Austin, TX, USA, 2016:16. DOI: 10.18653/v1/D16-1053.
[7] Wang K, An N, Li B N, et al. Speech emotion recognition using fourier parameters[J].IEEE Transactions on Affective Computing, 2017, 6(1):69-75. DOI: 10.1109/TAFFC.2015.2392101.
[8] Inger S E, Anya V H. Documentation of the danish emotional speech database(DES)[R]. Aalborg, Denmark: Center for Person Kommunikation, 1996.
[9] Burkhardt F, Paeschke A, Rolfes M, et al. A database of german emotional speech[C]//9th European Conference on Speech Communication and Technology. Lisbon, Portugal, 2005:15-30.
[10] Busso C, Bulut M, Lee C C, et al. IEMOCAP: Interactive emotional dyadic motion capture database[J].Language Resources and Evaluation, 2008, 42(4):335-359.

Memo

Memo:
Biographies: He Zhengran(1993—), male, Ph.D. candidate; Zhao Li(corresponding author), male, doctor, professor, zhaoli@seu.deu.cn.
Foundation item: The Key Research and Development Program of Jiangsu Province(No. BE2022059-3).
Citation: He Zhengran, Shen Qifan, Wu Jiaxin, et al.Transformer encoder-based multilevel representations with fusion feature input for speech emotion recognition[J].Journal of Southeast University(English Edition), 2023, 39(1):68-73.DOI:10.3969/j.issn.1003-7985.2023.01.008.
Last Update: 2023-03-20