|Table of Contents|

[1] Ye Jiayin, Zheng Wenming, Li Yang, Cai Youyi, et al. Multimodal emotion recognition based on deep neural network [J]. Journal of Southeast University (English Edition), 2017, 33 (4): 444-447. [doi:10.3969/j.issn.1003-7985.2017.04.009]

Multimodal emotion recognition based on deep neural network()

Journal of Southeast University (English Edition)[ISSN:1003-7985/CN:32-1325/N]

2017 4
Research Field:
Image Processing
Publishing date:


Multimodal emotion recognition based on deep neural network
Ye Jiayin Zheng Wenming Li Yang Cai Youyi Cui Zhen
School of Biological Sciences and Medical Engineering, Southeast University, Nanjing 210029, China
emotion recognition convolutional neural network(CNN) recurrent neural networks(RNN)
In order to increase the accuracy rate of emotion recognition in voice and video, the mixed convolutional neural network(CNN)and recurrent neural network(RNN)are used to encode and integrate the two information sources. For the audio signals, several frequency bands as well as some energy functions are extracted as low-level features by using a sophisticated audio technique, and then they are encoded with a one-dimensional(1D)convolutional neural network to abstract high-level features. Finally, these are fed into a recurrent neural network for the sake of capturing dynamic tone changes in a temporal dimensionality. As a contrast, a two-dimensional(2D)convolutional neural network and a similar RNN are used to capture dynamic facial appearance changes of temporal sequences. The method was used in the Chinese Natural Audio-Visual Emotion Database in the Chinese Conference on Pattern Recognition(CCPR)in 2016. Experimental results demonstrate that the classification average precision of the proposed method is 41.15%, which is increased by 16.62% compared with the baseline algorithm offered by the CCPR in 2016. It is proved that the proposed method has higher accuracy in the identification of emotional information.


[1] Zeng Z, Pantic M, Huang T S. Emotion recognition based on multimodal information [M]//Affective Information Processing. Springer, 2009: 241-265. DOI:10.1007/978-1-84800-306-4_14.
[2] Zeng Z, Pantic M, Roisman G I, et al. A survey of affect recognition methods: Audio, visual, and spontaneous expressions [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009, 31(1): 39-58. DOI:10.1109/TPAMI.2008.52.
[3] El Ayadi M, Kamel M S, Karray F. Survey on speech emotion recognition: Features, classification schemes, and databases [J]. Pattern Recognition, 2011, 44(3): 572-587. DOI:10.1016/j.patcog.2010.09.020.
[4] Chen L, Mao X, Xue Y, et al. Speech emotion recognition: Features and classification models [J]. Digital Signal Processing, 2012, 22(6): 1154-1160. DOI:10.1016/j.dsp.2012.05.007.
[5] Yan J, Wang X, Gu W, et al. Speech emotion recognition based on sparse representation [J]. Archives of Acoustics, 2013, 38(4): 465-470. DOI:10.2478/aoa-2013-0055.
[6] Zhao G, Pietikainen M. Dynamic texture recognition using local binary patterns with an application to facial expressions[J]. IEEE Transactions on Pattern Analysis And Machine Intelligence, 2007, 29(6): 915-928. DOI:10.1109/TPAMI.2007.1110.
[7] Petridis S, Gunes H, Kaltwang S, et al. Static vs. dynamic modeling of human nonverbal behavior from multiple cues and modalities [C]//Proceedings of the 2009 International Conference on Multimodal Interfaces. Cambridge, MA, USA, 2009:23-30. DOI:10.1145/1647314.1647321.
[8] Wang Y, Guan L, Venetsanopoulos A N. Audiovisual emotion recognition via cross-modal association in kernel space[C]// 2011 IEEE International Conference on Multimedia and Expo(ICME). Barcelona, Spain, 2011:6011949-1-6011949-6. DOI:10.1109/icme.2011.6011949.
[9] Metallinou A, Lee S, Narayanan S. Audio-visual emotion recognition using gaussian mixture models for face and voice[C]//Tenth IEEE International Symposium on Multimedia. Berkeley, CA, USA, 2008: 250-257. DOI:10.1109/ism.2008.40.
[10] Li Y, Tao J, Schuller B, et al. MEC 2016: The multimodal emotion recognition challenge of CCPR 2016 [M]//Pattern Recognition. Springer, 2016:667-678.
[11] Eyben F, Wöllmer M, Schuller B. Opensmile: The munich versatile and fast open-source audio feature extractor[C]//Proceedings of the 18th ACM International Conference on Multimedia. Firenze, Italy, 2010: 1459-1462. DOI:10.1145/1873951.1874246.
[12] Schuller B, Steidl S, Batliner A, et al. The INTERSPEECH 2010 paralinguistic challenge [C]//11th Annual Conference of the International Speech Communication Association. Makuhari, Chiba, Japan, 2010: 2795-2798.
[13] Ren S, He K, Girshick R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 39(6):1137-1149. DOI: 10.1109/TPAMI.2016.2577031.
[14] He G, Chen J, Liu X, et al. The SYSU system for CCPR 2016 multimodal emotion recognition challenge [M]//Pattern Recognition. Springer, 2016:707-720.
[15] Sun B, Xu Q, He J, et al. Audio-video based multimodal emotion recognition using SVMs and deep learning[M]//Communications in Computer and Information Science, 2016: 621-631. DOI:10.1007/978-981-10-3005-5_51.


Biographies: Ye Jiayin(1993—), female, graduate; Zheng Wenming(corresponding author), male, doctor, professor, wenming_zheng@seu.edu.cn.
Citation: Ye Jiayin, Zheng Wenming, Li Yang, et al. Multimodal emotion recognition based on deep neural network[J].Journal of Southeast University(English Edition), 2017, 33(4):444-447.DOI:10.3969/j.issn.1003-7985.2017.04.009.
Last Update: 2017-12-20