|Table of Contents|

[1] Ye Jiayin, Zheng Wenming, Li Yang, Cai Youyi, et al. Multimodal emotion recognition based on deep neural network [J]. Journal of Southeast University (English Edition), 2017, 33 (4): 444-447. [doi:10.3969/j.issn.1003-7985.2017.04.009]
Copy

Multimodal emotion recognition based on deep neural network()
基于深度神经网络的多模态情感识别
Share:

Journal of Southeast University (English Edition)[ISSN:1003-7985/CN:32-1325/N]

Volumn:
33
Issue:
2017 4
Page:
444-447
Research Field:
Image Processing
Publishing date:
2017-12-30

Info

Title:
Multimodal emotion recognition based on deep neural network
基于深度神经网络的多模态情感识别
Author(s):
Ye Jiayin Zheng Wenming Li Yang Cai Youyi Cui Zhen
School of Biological Sciences and Medical Engineering, Southeast University, Nanjing 210029, China
叶佳音 郑文明 李阳 蔡友谊 崔振
东南大学生物科学与医学工程学院, 南京 210029
Keywords:
emotion recognition convolutional neural network(CNN) recurrent neural networks(RNN)
情感识别 卷积神经网络 递归神经网络
PACS:
TP751
DOI:
10.3969/j.issn.1003-7985.2017.04.009
Abstract:
In order to increase the accuracy rate of emotion recognition in voice and video, the mixed convolutional neural network(CNN)and recurrent neural network(RNN)are used to encode and integrate the two information sources. For the audio signals, several frequency bands as well as some energy functions are extracted as low-level features by using a sophisticated audio technique, and then they are encoded with a one-dimensional(1D)convolutional neural network to abstract high-level features. Finally, these are fed into a recurrent neural network for the sake of capturing dynamic tone changes in a temporal dimensionality. As a contrast, a two-dimensional(2D)convolutional neural network and a similar RNN are used to capture dynamic facial appearance changes of temporal sequences. The method was used in the Chinese Natural Audio-Visual Emotion Database in the Chinese Conference on Pattern Recognition(CCPR)in 2016. Experimental results demonstrate that the classification average precision of the proposed method is 41.15%, which is increased by 16.62% compared with the baseline algorithm offered by the CCPR in 2016. It is proved that the proposed method has higher accuracy in the identification of emotional information.
为了提升音频和视频载体中的情感识别准确率, 采用混合卷积神经网络和递归神经网络编码和集成视频与音频信息来源.通过智能的音频技术, 从音频信号提取底层特征, 然后用一维卷积神经网络抽象出高级特征, 最后送入递归神经网络捕捉时间维度上的语调变化.作为对比, 使用二维卷积神经网络和一个类似的卷积神经网络捕捉动态面部外观变化.该方法在2016年度中国模式识别会议提供的中国视觉与听觉情感数据库上达到了41.15%的平均精确度, 相比会议基准算法的准确率提升了16.62%.证明所采用方法在情感信息识别中有更高的准确性.

References:

[1] Zeng Z, Pantic M, Huang T S. Emotion recognition based on multimodal information [M]//Affective Information Processing. Springer, 2009: 241-265. DOI:10.1007/978-1-84800-306-4_14.
[2] Zeng Z, Pantic M, Roisman G I, et al. A survey of affect recognition methods: Audio, visual, and spontaneous expressions [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009, 31(1): 39-58. DOI:10.1109/TPAMI.2008.52.
[3] El Ayadi M, Kamel M S, Karray F. Survey on speech emotion recognition: Features, classification schemes, and databases [J]. Pattern Recognition, 2011, 44(3): 572-587. DOI:10.1016/j.patcog.2010.09.020.
[4] Chen L, Mao X, Xue Y, et al. Speech emotion recognition: Features and classification models [J]. Digital Signal Processing, 2012, 22(6): 1154-1160. DOI:10.1016/j.dsp.2012.05.007.
[5] Yan J, Wang X, Gu W, et al. Speech emotion recognition based on sparse representation [J]. Archives of Acoustics, 2013, 38(4): 465-470. DOI:10.2478/aoa-2013-0055.
[6] Zhao G, Pietikainen M. Dynamic texture recognition using local binary patterns with an application to facial expressions[J]. IEEE Transactions on Pattern Analysis And Machine Intelligence, 2007, 29(6): 915-928. DOI:10.1109/TPAMI.2007.1110.
[7] Petridis S, Gunes H, Kaltwang S, et al. Static vs. dynamic modeling of human nonverbal behavior from multiple cues and modalities [C]//Proceedings of the 2009 International Conference on Multimodal Interfaces. Cambridge, MA, USA, 2009:23-30. DOI:10.1145/1647314.1647321.
[8] Wang Y, Guan L, Venetsanopoulos A N. Audiovisual emotion recognition via cross-modal association in kernel space[C]// 2011 IEEE International Conference on Multimedia and Expo(ICME). Barcelona, Spain, 2011:6011949-1-6011949-6. DOI:10.1109/icme.2011.6011949.
[9] Metallinou A, Lee S, Narayanan S. Audio-visual emotion recognition using gaussian mixture models for face and voice[C]//Tenth IEEE International Symposium on Multimedia. Berkeley, CA, USA, 2008: 250-257. DOI:10.1109/ism.2008.40.
[10] Li Y, Tao J, Schuller B, et al. MEC 2016: The multimodal emotion recognition challenge of CCPR 2016 [M]//Pattern Recognition. Springer, 2016:667-678.
[11] Eyben F, Wöllmer M, Schuller B. Opensmile: The munich versatile and fast open-source audio feature extractor[C]//Proceedings of the 18th ACM International Conference on Multimedia. Firenze, Italy, 2010: 1459-1462. DOI:10.1145/1873951.1874246.
[12] Schuller B, Steidl S, Batliner A, et al. The INTERSPEECH 2010 paralinguistic challenge [C]//11th Annual Conference of the International Speech Communication Association. Makuhari, Chiba, Japan, 2010: 2795-2798.
[13] Ren S, He K, Girshick R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 39(6):1137-1149. DOI: 10.1109/TPAMI.2016.2577031.
[14] He G, Chen J, Liu X, et al. The SYSU system for CCPR 2016 multimodal emotion recognition challenge [M]//Pattern Recognition. Springer, 2016:707-720.
[15] Sun B, Xu Q, He J, et al. Audio-video based multimodal emotion recognition using SVMs and deep learning[M]//Communications in Computer and Information Science, 2016: 621-631. DOI:10.1007/978-981-10-3005-5_51.

Memo

Memo:
Biographies: Ye Jiayin(1993—), female, graduate; Zheng Wenming(corresponding author), male, doctor, professor, wenming_zheng@seu.edu.cn.
Citation: Ye Jiayin, Zheng Wenming, Li Yang, et al. Multimodal emotion recognition based on deep neural network[J].Journal of Southeast University(English Edition), 2017, 33(4):444-447.DOI:10.3969/j.issn.1003-7985.2017.04.009.
Last Update: 2017-12-20