|Table of Contents|

[1] Song Peng, Jin Yun, Bao Yongqiang, et al. Efficient fundamental frequency transformationfor voice conversion [J]. Journal of Southeast University (English Edition), 2012, 28 (2): 140-144. [doi:10.3969/j.issn.1003-7985.2012.02.002]
Copy

Efficient fundamental frequency transformationfor voice conversion()
用于语音转换的有效基音频率转换算法
Share:

Journal of Southeast University (English Edition)[ISSN:1003-7985/CN:32-1325/N]

Volumn:
28
Issue:
2012 2
Page:
140-144
Research Field:
Information and Communication Engineering
Publishing date:
2012-06-30

Info

Title:
Efficient fundamental frequency transformationfor voice conversion
用于语音转换的有效基音频率转换算法
Author(s):
Song Peng1 Jin Yun1 2 Bao Yongqiang3 Zhao Li1 Zou Cairong1
1 Key Laboratory of Underwater Acoustic Signal Processing of Ministry of Education, Southeast University, Nanjing 210096, China
2 School of Physics and Electronic Engineering, Xuzhou Normal University, Xuzhou 221116, China
宋鹏1 金赟1 2 包永强3 赵力1 邹采荣1
1东南大学水声信号处理教育部重点实验室, 南京 210096; 2徐州师范大学物理与电子工程学院, 徐州 221116; 3南京工程学院通信工程学院, 南京 211167
Keywords:
-
基音频率预测 支持向量回归 均值-方差线性转换 自适应中值滤波
PACS:
TN912.3
DOI:
10.3969/j.issn.1003-7985.2012.02.002
Abstract:
In order to improve the performance of voice conversion, the fundamental frequency(F0)transformation methods are investigated, and an efficient F0 transformation algorithm is proposed. First, unlike the traditional linear transformation methods, the relationships between F0s and spectral parameters are explored. In each component of the Gaussian mixture model(GMM), the F0s are predicted from the converted spectral parameters using the support vector regression(SVR)method. Then, in order to reduce the over-smoothing caused by the statistical average of the GMM, a mixed transformation method combining SVR with the traditional mean-variance linear(MVL)conversion is presented. Meanwhile, the adaptive median filter, prevalent in image processing, is adopted to solve the discontinuity problem caused by the frame-wise transformation. Objective and subjective experiments are carried out to evaluate the performance of the proposed method. The results demonstrate that the proposed method outperforms the traditional F0 transformation methods in terms of the similarity and the quality.
为了改善语音转换的性能, 对基音频率转换方法进行了研究, 并提出了一种有效的转换算法.首先, 不同于传统的线性变换方法, 对基音频率和频谱特征的内在关系进行了分析, 在GMM中的每一分量, 基音频率通过SVR方法从转换后的频谱特征预测得到.然后, 为了缓解GMM统计平均带来的过平滑问题, 将传统的均值-方差转换方法和SVR方法相结合.同时, 引入广泛应用于图像处理的自适应中值滤波来解决由基于帧转换引起的不连续问题.通过主客观评价方法对转换后的语音质量进行了测试, 结果表明:该方法无论在语音的相似度还是转换语音的质量上, 都取得了比传统方法更好的效果.

References:

[1] Stylianou Y, Cappé O, Moulines E. Continuous probabilistic transform for voice conversion [J]. IEEE Transactions on Speech and Audio Processing, 1998, 6(2):131-142.
[2] Kain A, Macon M W. Spectral voice conversion for text-to-speech synthesis [C]//International Conference on Acoustics, Speech, and Signal Processing. Seattle, USA, 1998: 285-288.
[3] Inanoglu Z. Transforming pitch in a voice conversion framework [D]. Cambridge, UK: St.Edmund’s College of the University of Cambridge, 2003: 28-32.
[4] Wu Z Z, Kinnunen T, Chng E S, et al. Text-independent F0 transformation with non-parallel data for voice conversion [C]//11th Annual Conference of the International Speech Communication Association. Makuhari, Japan, 2010: 1732-1735.
[5] Shao X, Milner B. Pitch prediction from MFCC vectors for speech reconstruction [C]//Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. Montreal, Canada, 2004: 97-100.
[6] Basak D, Pal S, Patranabis D C. Support vector regression [J]. Neural Information Processing—Letters and Reviews, 2007, 11(10): 203-224.
[7] Song P, Bao Y Q, Zhao L, et al. Voice conversion using support vector regression [J]. Electronics Letters, 2011, 47(18): 1045-1046.
[8] Hwang H, Haddad R A. Adaptive median filters: new algorithms and results [J]. IEEE Transactions on Image Processing, 1995, 4(4): 499-502.
[9] Kominek J, Black A W. The CMU Arctic speech databases [C]//Proceedings of the 5th ISCA Speech Synthesis Workshop. Pittsburgh, USA, 2004: 223-224.
[10] Kawahara H, Masuda-Katsuse I, de Cheveigné A. Restructuring speech representation using pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: possible role of a repetitive structure in sounds [J]. Speech Communication, 1999, 27(3): 187-207.

Memo

Memo:
Biographies: Song Peng(1983—), male, graduate; Zhao Li(corresponding author), male, doctor, professor, zhaoli@seu.edu.cn.
Foundation items: The National Natural Science Foundation of China(No.60975017), the Natural Science Foundation of Guangdong Province(No.10252800001000001), the Natural Science Foundation of Higher Education Institutions of Jiangsu Province(No.10KJB510005).
Citation: Song Peng, Jin Yun, Bao Yongqiang, et al.Efficient fundamental frequency transformation for voice conversion[J].Journal of Southeast University(English Edition), 2012, 28(2):140-144.[doi:10.3969/j.issn.1003-7985.2012.02.002]
Last Update: 2012-06-20