|Table of Contents|

[1] Zhang Xinran, Song Peng, Zha Cheng, Tao Huawei, et al. Auditory attention model based on Chirpletfor cross-corpus speech emotion recognition [J]. Journal of Southeast University (English Edition), 2016, 32 (4): 402-407. [doi:10.3969/j.issn.1003-7985.2016.04.002]
Copy

Auditory attention model based on Chirpletfor cross-corpus speech emotion recognition()
Share:

Journal of Southeast University (English Edition)[ISSN:1003-7985/CN:32-1325/N]

Volumn:
32
Issue:
2016 4
Page:
402-407
Research Field:
Information and Communication Engineering
Publishing date:
2016-12-20

Info

Title:
Auditory attention model based on Chirpletfor cross-corpus speech emotion recognition
Author(s):
Zhang Xinran1 Song Peng2 Zha Cheng1 Tao Huawei1 Zhao Li1
1Key Laboratory of Underwater Acoustic Signal Processing of Ministry of Education, Southeast University, Nanjing 210096, China
2School of Computer and Control Engineering, Yantai University, Yantai 264005, China
Keywords:
speech emotion recognition selective attention mechanism spectrogram feature cross-corpus
PACS:
TN912.34
DOI:
10.3969/j.issn.1003-7985.2016.04.002
Abstract:
To solve the problem of mismatching features in an experimental database, which is a key technique in the field of cross-corpus speech emotion recognition, an auditory attention model based on Chirplet is proposed for feature extraction. First, in order to extract the spectra features, the auditory attention model is employed for variational emotion features detection. Then, the selective attention mechanism model is proposed to extract the salient gist features which show their relation to the expected performance in cross-corpus testing. Furthermore, the Chirplet time-frequency atoms are introduced to the model. By forming a complete atom database, the Chirplet can improve the spectrum feature extraction including the amount of information. Samples from multiple databases have the characteristics of multiple components. Hereby, the Chirplet expands the scale of the feature vector in the time-frequency domain. Experimental results show that, compared to the traditional feature model, the proposed feature extraction approach with the prototypical classifier has significant improvement in cross-corpus speech recognition. In addition, the proposed method has better robustness to the inconsistent sources of the training set and the testing set.

References:

[1] Song P, Jin Y, Zha C, et al. Speech emotion recognition method based on hidden factor analysis[J]. Electronics Letters, 2014, 51(1): 112-114. DOI:10.1049/el.2014.3339.
[2] Schuller B, Zhang Z, Weninger F, et al. Synthesized speech for model training in cross-corpus recognition of human emotion[J]. International Journal of Speech Technology, 2012, 15(3): 313-323. DOI:10.1007/s10772-012-9158-0.
[3] Deng J, Zhang Z, Marchi E, et al. Sparse autoencoder-based feature transfer learning for speech emotion recognition[C]//IEEE 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction. Geneva, Switzerland, 2013: 511-516. DOI:10.1109/acii.2013.90.
[4] Jin Y, Song P, Zheng W, et al. Speaker-independent speech emotion recognition based on two-layer multiplekernel learning[J]. IEICE Transactions on Information and Systems, 2013, 96(10): 2286-2289. DOI:10.1587/transinf.e96.d.2286.
[5] Kalinli O, Narayanan S. Prominence detection using auditory attention cues and task-dependent high level information[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2009, 17(5): 1009-1024. DOI:10.1109/tasl.2009.2014795.
[6] Kalinli O. Syllable segmentation of continuous speech using auditory attention cues[C]//International Speech and Communication Association. Florence, Italy, 2011: 425-428.
[7] Wong W K, Zhao H T. Supervised optimal locality preserving projection[J]. Pattern Recognition, 2012, 45(1): 186-197. DOI:10.1016/j.patcog.2011.05.014.
[8] Yin Q, Qian S, Feng A. A fast refinement for adaptive Gaussian chirplet decomposition[J]. IEEE Transactions on Signal Processing, 2002, 50(6): 1298-1306. DOI:10.1109/tsp.2002.1003055.
[9] Bayram I. An analytic wavelet transform with a flexible time-frequency covering[J]. IEEE Transactions on Signal Processing, 2013, 61(5): 1131-1142. DOI:10.1109/tsp.2012.2232655.
[10] Noriega G. A neural model to study sensory abnormalities and multisensory effects in autism[J]. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 2015, 23(2): 199-209. DOI:10.1109/TNSRE.2014.2363775.
[11] Khoubrouy S A, Panahi I M S, Hansen J H L. Howling detection in hearing aids based on generalized teager-kaiser operator[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2015, 23(1): 154-161. DOI:10.1109/taslp.2014.2377575.
[12] Ali S A, Khan A, Bashir N. Analyzing the impact of prosodic feature(pitch)on learning classifiers for speech emotion corpus[J]. International Journal of Information Technology and Computer Science, 2015, 7(2): 54-59.DOI:10.5815/ijitcs.2015.02.07.
[13] Ajmera P K, Jadhav D V, Holambe R S. Text-independent speaker identification using radon and discrete cosine transforms based features from speech spectrogram[J]. Pattern Recognition, 2011, 44(10): 2749-2759. DOI:10.1016/j.patcog.2011.04.009.
[14] Burkhardt F, Paeschke A, Rolfes M, et al. A database of german emotional speech[C]//International Speech and Communication Association. Lisbon, Portugal, 2005: 1517-1520.
[15] Martin O, Kotsia I, Macq B, et al. The enterface’05 audio-visual emotion database[C]//IEEE 22nd International Conference on Data Engineering Workshops. San Francisco, CA, USA, 2006: 8-10. DOI:10.1109/icdew.2006.145.
[16] Schuller B, Steidl S, Batliner A, et al. The interspeech 2010 paralinguistic challenge: Deception, sincerity and native language [C]//International Speech and Communication Association. Chiba, Japan, 2010: 2794-2797. DOI:10.21437/interspeech.2016-129.
[17] Eyben F, Wöllmer M, Schuller B. Opensmile: The munich versatile and fast open-source audio feature extractor[C]//Proceedings of the International Conference on Multimedia. Firenze, Italy, 2010: 1459-1462.
[18] Moustakidis S, Mallinis G, Koutsias N, et al. SVM-based fuzzy decision trees for classification of high spatial resolution remote sensing images[J]. IEEE Transactions on Geoscience and Remote Sensing, 2012, 50(1): 149-169. DOI:10.1109/TGRS.2011.2159726.
[19] Kim E H, Hyun K H, Kim S H, et al. Improved emotion recognition with a novel speaker-independent feature[J]. IEEE/ASME Transactions on Mechatronics, 2009, 14(3): 317-325.

Memo

Memo:
Biographies: Zhang Xinran(1980—), male, graduate; Zhao Li(corresponding author), male, doctor, professor, zhaoli@seu.edu.cn.
Foundation items: The National Natural Science Foundation of China(No.61273266, 61231002, 61301219, 61375028), the Specialized Research Fund for the Doctoral Program of Higher Education(No.20110092130004), the Natural Science Foundation of Shandong Province(No.ZR2014FQ016).
Citation: Zhang Xinran, Song Peng, Zha Cheng, et al. Auditory attention model based on Chirplet for cross-corpus speech emotion recognition[J].Journal of Southeast University(English Edition), 2016, 32(4):402-407.DOI:10.3969/j.issn.1003-7985.2016.04.002.
Last Update: 2016-12-20