A two-stage frequency-domain blind source separation methodfor underdetermined instantaneous mixtures

A two-stage frequency-domain blind source separation methodfor underdetermined instantaneous mixtures

Peng Tianliang  Chen Yang

(School of Information Science and Engineering, Southeast University, Nanjing 210096, China)

Abstract:In order to decrease the probability of missing some data points or noises being added in the inverse truncated mixing matrix (ITMM) algorithm, a two-stage frequency-domain method is proposed for blind source separation of underdetermined instantaneous mixtures. The separation process is decomposed into two steps of ITMM and matrix completion in the view that there are many soft-sparse (not very sparse) sources. First, the mixing matrix is estimated and the sources are recovered by the traditional ITMM algorithm in the frequency domain. Then, in order to retrieve the missing data and remove noises, the matrix completion technique is applied to each preliminary estimated source by the traditional ITMM algorithm in the frequency domain. Simulations show that, compared with the traditional ITMM algorithms, the proposed two-stage algorithm has better separation performances. In addition, the time consumption problem is considered. The proposed algorithm outperforms the traditional ITMM algorithm at a cost of no more than one-fourth extra time consumption.

Key words:inverse truncated mixing matrix; under-determined blind source separation (UBSS); frequency domain; matrix completion

The underdetermined blind source separation (UBSS) technique is the case that the number of sensors is fewer than that of the sources. The sparseness-based approach has been widely employed to solve the UBSS problem. Sparseness-based methods can be mainly classified into two classes. The first class is a two-stage approach: mixing matrix estimation and source recovery. Many researchers have proposed different source recovery approaches, such as the maximum a posteriori (MAP) assumption[1-2], the MMSE method[3], and the inverse truncated mixing matrix (ITMM) algorithm[4]. The second class extracts each source with time-frequency binary masks[5-6]. The ITMM is a powerful method for underdetermined blind source separation[4]. The separation performance is usually affected by the assumptions and parameter sizes of the ITMM algorithm. In order to improve the robustness and quality of the ITMM algorithm, we propose a two-stage method with joint matrix completion and ITMM algorithm in this paper. We apply the ITMM algorithm to estimate the time-frequency (TF) domain source in the first stage. Then, the matrix completion algorithm is implemented to every estimated source’s TF matrix. In our simulation, the mixtures of audio and speech signals are considered and the results show that this approach yields a good performance.

1 ITMM Method for UBSS

The block diagram of the proposed underdetermined UBSS algorithm is shown in Fig.1.

Fig.1 Block diagram of the proposed underdetermined UBSS algorithm

For the instantaneous UBSS case, let xt={xt(1), xt(2), …, xt(M)}T be an M-dimensional column vector corresponding to the output of M sensors at a given discrete time instant t, and let X be an M×T matrix corresponding to the sensor data at all times t=1, 2,…,T. Let S={s1, s2, …, sT} be the N×T matrix of underlying source signals, where st={st(1), st(2), …, st(N)}T and let A={a1, a2, …, aN} be the mixing matrix with size M×N, where ai={ai1, ai2, …, aiM}T and MThen the instantaneous UBSS case can be written as

X=AS

(1)

or equivalent in vector version as

xt=Ast

(2)

Assumption 1 The column vectors of A are assumed to be pairwise linearly independent, that is, for any k,l∈Λ, where Λ={1, 2, …, N} and k≠l, we can derive the relationship that ak and al are linearly independent.

Assumption 1 is usually made in the context of blind source separation (BSS)[7]. It is also known that BSS goes only possibly up to an unknown scaling and an unknown permutation of the original sources[8].

In order to transform signals into the frequency domain, we apply a short-time Fourier transform (STFT) to the sensor observations xt. Then, the instantaneous mixture at each frequency domain is given by

x(f,t)=As(f,t)

(3)

where x(f,t)={x1(f,t),x2(f,t),…,xM(f,t)}T,s(f,t)={s1(f,t),s2(f,t),…,sN(f,t)}T are the TF representations of the sensors x(t) and the sources s(t), respectively.

Assumption 2 The number of coexisting sources at a single TF point is smaller than that of sensors. In other words, if we assume that the number of sensors is M, then the largest number of sources that exist at any TF point is M-1.

Actually Assumption 2 is a relaxed condition to the W-disjoint orthogonality of the time-frequency masking technique[6].

The ITMM-UBSS method can be divided into two sections. The first section is the mixing matrix estimation, which is realized by the mask-TFD method[3] in this paper; the second section is the source recovery in the frequency domain by the traditional ITMM algorithm. As the ITMM algorithm shown in Ref.[4], we can obtain the general version of si(t,f ) through the union of TF points as follows:

{}

(4)

(5)

k2,…,akM-1, alj]with ljΓ\i and ljΛ(j=1, 2,…, N-M+1). must satisfy

(6)

where Im(x) represents the imaginary part of x; x* is the conjugate of x; δik1k2kM-2 is a very small positive value. In Matlab simulations, we write Eq.(6) as

(7)

where angle(x) represents the angle of x.

2 Matrix Completion

Many signals become sparse after they are transformed into the frequency domain, such as speech signals. Suppose that we have a rank-r matrix A of size M×N, where r≤min(M, N). In many engineering problems, the entries of the matrix are often corrupted by errors or noise, and some of the entries can even be missed, or only a set of measurements of the matrix is accessible rather than its entries directly. In general, we model the observed matrix B to be a set of linear measurements on matrix A, subject to noise and gross corruptions, i.e.,

B=L(A)+n

(8)

where L is a linear operator, and n represents the matrix of corruptions. We seek to recover the true matrix A from B. We now consider the case where L is the identity operator and n is a sparse matrix (i.e., most of its entries are zero) but whose non-zero entries can be practically unbounded. Since the rank r of A is unknown, the problem is to find the matrix of the lowest rank that can generate B when added to an unknown sparse matrix n. Mathematically, for an appropriate choice of parameter η>0, we have the following combinatorial optimization problem to solve:

‖n‖0  s.t. B=A+n

(9)

where η is a positive constant, and ‖·‖0 is the counting norm (i.e., the number of non-zero entries in the matrix). Since this problem cannot be efficiently solved, we consider its convex relaxation instead:

‖A‖*+η‖n‖1  s.t. B=A+n

(10)

where ‖σk(A); σk(A) denotes the k-th largest singular value of A; and ‖·‖1 represents the matrix 1-norm (i.e., the sum of absolute values of all entries of the matrix). We call the above convex program the robust principal component analysis (RPCA)[9-10].

3 Joint Matrix Completion with ITMM

Suppose that we have recovered sources in the frequency domain by the ITMM method. We assume that the estimated source matrix is i with the size of F×T, and F and T are the number of frequency bins and time frames, respectively. The elements of i are i(f, t) (1≤fF, 1≤tT). We assume that i and the true sources Si have the following relationship:

(11)

Then the problem that we want to obtain Si via Eq.(11) is equivalent to the matrix completion problem as in Eq.(8). So, we can use the matrix completion technique for every frequency domain matrix of estimated sources to retrieve the missing data and remove the noise data. We use the method in Ref.[10] to simulate the matrix completion in this paper.

In summary, the joint matrix completion with the ITMM algorithm is as follows:

1) The traditional ITMM-UBSS algorithm is used in the frequency domain. First, estimate the mixing matrix with the mask-STD method[3].Secondly, recover sources with the ITMM algorithm.

2) The matrix completion method is used to estimate each source based on the traditional ITMM-UBSS algorithm in the frequency domain. Then, the time domain sources can be obtained by the inverse STFT.

4 Simulation

To demonstrate the two-stage method, we divide the simulation into two sections: One is the SiSEC 2008 data[11] simulation, and the other is the simulation with random M×6 mixing matrix A for M=2, 3, …, 5.

4.1 SiSEC 2008 data

A linear instantaneous stereo mixture (with positive mixing coefficients) of two drum sources and one bass line, and four female speeches mixed with three sensors are used in this paper.

In this paper, the length of STFT is 1 024; the window overlap is 0.5; all the signals are sampled at 16 kHz; and the sample length is 160 000. Then, the size of i is 512×313. The algorithms are coded with Matlab and run on an Intel Core(TM) 2 Duo (2.20 GHz) processor.

Fig.2 shows the spectrum of two sensors. The spectrum of three sources, three separated sources with the traditional ITMM, and three separated sources with the proposed method are shown in Fig.3, Fig.4 and Fig.5, respectively. The δi of the traditional ITMM and the proposed method are all set to be δ123=0.3.

(a)

(b)
Fig.2 Spectrum of observations. (a) The first observation;(b) The second observation

(a)

(b)
Fig.3 Spectrum of three sources. (a) The baseline; (b) The first drum; (c) The second drum

(a)

(b)

(c)
Fig.4 Spectrum of three separated sources with traditional ITMM. (a) The baseline; (b) The first drum; (c) The second drum

As shown in Fig.4 and Fig.5, since the matrix completion technique in the proposed method can retrieve the missing data and reject noise, compared with the traditional ITMM algorithm, the spectra of the proposed method can be seen more clearly. The traditional ITMM algorithm must satisfy Assumption 1 and Assumption 2, which may be the main reason for corruption by errors or noise in the time-frequency domain of estimated sources.

(a)

(b)

(c)
Fig.5 Spectrum of three separated sources with the proposed method. (a) The baseline; (b) The first drum; (c) The second drum

To measure the performance, we decompose an estimated signal as =starget+einterf+eartif+enoise[12]. As the measurement of performance, we use the source-to-distortion ratio

(20)

the source-to-interference ratio

(21)

and the source-to-artifact ratio

(22)

The higher values of εSDR, εSIR and εSAR indicate better performance. We must note that the ITMM is a source recovery method, and the mixing matrix estimation problem in UBSS is not considered. We use the method proposed in Ref.[3] to estimate the mixing matrix in this paper.

Tab.1 is the performance evaluation of the ITMM and the proposed algorithms of mixtures with three audio sources, namely S1, S2 and S3, and two sensors under δ123=0.3. As shown in Tab.1, the proposed method can obtain a higher performance than the traditional ITMM method except for εSIR of the third source. Tab.2 is the performance evaluation of the ITMM and the proposed methods of mixtures with four speeches and three sensors under δ1234=0.3. As shown in Tab.2, the proposed method can obtain a higher performance than the traditional ITMM method except for εSIR of the fourth source. However, the average performance evaluations of the proposed method are higher than those of the traditional ITMM method as shown in Tab.1 and Tab.2. The reason is that the matrix completion of the proposed joint method can complete and extend the frequency bin of estimated sources.

Tab.1 Performance evaluation of ITMM and proposed methods of mixtures with three audio sources and two sensors dB

ItemsMethodsS1S2S3εSDRITMM7.370.7927.26Proposed8.010.9228.02εSIRITMM14.5813.8230.14Proposed14.8414.2330.00εSARITMM8.361.3330.41Proposed10.002.1132.65

Tab.2 Quality evaluation of ITMM and proposed methods of mixtures with four speeches and three sensors dB

ItemsMethodsS1S2S3S4εSDRITMM9.747.119.3211.32Proposed11.0110.1610.3512.59εSIRITMM16.2113.5216.0018.68Proposed17.0814.2316.6017.57εSARITMM10.958.4310.4912.26Proposed12.009.1111.8613.15

Fig.6 shows the average performances of εSDR, εSIR, εSAR for separating the three speech sources from two sensors under the condition of different values of δi. As shown in Fig.6, εSDR, εSIR, and εSAR can reach their maximum values during different intervals, and obtain high values when 0.2≤δi≤0.4, 0.01≤δi≤0.2, and 0.8≤δi≤1.0, respectively. Our simulation suggests that the speeches and audio waves maintain high quality in the auditory perception when 0.1≤δi≤0.4.

4.2 Simulations with random M×6 mixing matrix A for M=2, 3, 4, …, 5

The random simulations are completed in this section. We choose the random M×6 mixing matrix A for M=2, 3, 4, …, 5. Three speech sources and three music sources are chosen from SiSEC 2008 data[11]. We carry out the random simulations 50 times for each mixing model. Fig.7 shows the simulations of the six sources in the time domain.

Tab.3 is the average quality evaluation of the ITMM and the proposed methods in random cases. As shown in Tab.3, εSDR, εSIRand εSAR of the proposed method are all higher than those of the traditional ITMM algorithm except for εSIR of 2×6 and 4×6 when δ=0.05.

(a)

(b)

(c)
Fig.6 Average performances of the traditional ITMM and the proposed methods for separating three audio sources from two sensors under different values of δ(δ123=δ). (a) εSDR; (b) εSIR; (c) εSAR

Fig.7 The six testing sources in the time domain

Tab.3 Average performance evaluation of ITMM and proposed methods in random cases dB

Items2×6matrix3×6matrix4×6matrix5×6matrixITMMProposedITMMProposedITMMProposedITMMProposedδ=0.05εSDR-4.69-4.131.141.195.385.4412.9213.01εSIR9.429.3811.8812.0113.9813.8821.5321.66εSAR-3.18-2.942.302.336.556.7813.8814.42δ=0.1εSDR-3.09-2.771.922.214.54.7611.2511.34εSIR9.179.9610.4011.9511.3713.5318.5318.89εSAR-1.46-1.133.534.466.197.0112.6612.79δ=0.2εSDR-1.67-1.341.631.873.263.689.3210.93εSIR8.8010.918.749.939.1410.3115.2816.01εSAR0.191.094.035.725.566.5111.3612.09δ=0.3εSDR-0.89-0.611.211.341.581.788.129.03εSIR7.948.016.326.896.407.6213.6514.11εSAR1.481.544.454.774.966.0110.6611.24δ=0.4εSDR-0.60-0.561.011.061.351.447.877.94εSIR7.237.455.595.855.786.0012.4412.67εSAR1.913.123.674.194.416.1310.1212.91

Tab.4 shows the average computational time (in all the cases shown in Tab.3) in random simulations. As shown in Tab.4, the proposed method consumes about 4 s more than the traditional ITMM method in all the cases. The proposed method is a little time-consuming but its average performances are improved.

Tab.4 Computational time for the proposed and traditional ITMM methods s

CasesITMMProposed2×6matrix18.7322.233×6matrix20.2423.694×6matrix19.5123.075×6matrix18.1921.48

5 Conclusion

In this paper, we propose a two-stage frequency domain algorithm for underdetermined instantaneous blind source separation. In the first stage, we use the ITMM algorithm[4] to extract the sources in the frequency domain. In the second stage, we use the matrix completion technique to retrieve the missing data and remove the noise data of the estimated source matrix (corrupted by errors or noise) in the time-frequency domain. Simulation results show that the proposed algorithm has a higher performance compared with the conventional ITMM algorithm.

References:

[1]Cobos M, Lopez J J. Maximum a posteriori binary mask estimation for underdetermined source separation using smoothed posteriors [J]. IEEE Transactions on Audio, Speech, and Language Processing, 2012, 20(7): 2059-2064. DOI:10.1109/tasl.2012.2195654.

[2]Leglaive S, Badeau R, Richard G. Multichannel audio source separation with probabilistic reverberation modeling[C]//IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). New York, USA, 2015: 1-5.

[3]Peng Tianliang, Chen Yang, Liu Zengli. A time-frequency domain blind source separation method for underdetermined instantaneous mixtures [J]. Circuits Systems & Signal Processing, 2015, 34(12): 3883-3895.

[4]Peng Tianliang, Chen Yang. Inverse truncated mixing matrix (ITMM) algorithm application to underdetermined convolutive blind speech sources separation [C]//IEEE International Conference on Signal Processing, Communications and Computing. Ningbo, China, 2015: 771-776.

[5]Yilmaz O, Rickard S. Blind separation of speech mixtures via time-frequency masking [J]. IEEE Transactions on Signal Processing, 2004, 52(7):1830-1847.

[6]Sawada H, Araki S, Makino S. Underdetermined convolutive blind source separation via frequency bin-wise clustering and permutation alignment [J].IEEE Transactions on Audio, Speech, and Language Processing, 2011, 19(3): 516-527. DOI:10.1109/tasl.2010.2051355.

[7]Linh-Trung N, Belouchrani A, Abed-Meraim K, et al. Separating more sources than sensors using time-frequency distributions [J]. EURASIP Journal on Applied Signal Processing, 2005, 2005: 2828-2847. DOI:10.1155/asp.2005.2828.

[8]Belouchrani A, Abed-Meraim K, Cardoso J F, et al. A blind source separation technique using second-order statistics [J]. IEEE Transactions on Signal Processing, 1997, 45(2): 434-444. DOI:10.1109/78.554307.

[9]Cai J F, Candès E J, Shen Z. A singular value thresholding algorithm for matrix completion [J]. SIAM Journal on Optimization, 2010, 20(4):1956-1982. DOI:10.1137/080738970.

[10]Candès E J, Recht B. Exact matrix completion via convex optimization [J]. Foundations of Computational Mathematics, 2009, 9(6): 717-772.

[11]Vincent E, Sawada H, Bofill P, et al. First stereo audio source separation evaluation campaign: Data, algorithms and results [M]//Independent Component Analysis and Signal Separation. Berlin: Springer, 2007: 552-559.

[12]Vincent E, Gribonval R, Févotte C. Performance measurement in blind audio source separation [J]. IEEE Transactions on Audio, Speech, and Language Processing, 2006, 14(4): 1462-1469. DOI:10.1109/tsa.2005.858005.

一种基于频域的欠定瞬时混合2步盲分离法

彭天亮  陈 阳

(东南大学信息科学与工程学院,南京210096)

摘要:为了减少传统的截断混合矩阵求逆(ITMM)算法在个别时频点会丢失数据或者产生噪声信号的概率,提出了一种基于频域的2步欠定瞬时盲分离算法.由于现实中存在大量软稀疏(稀疏度不是很大)混合信号,将分离过程分解为ITMM和矩阵补偿2个步骤.首先估计出混合矩阵和利用经典的ITMM算法对混合信号进行初步恢复,然后对初步估计的信号时频矩阵进行矩阵补偿处理,从而达到修补丢失数据和去除多余数据(去噪)的效果.实验仿真验证了所提出的2步分离法相对于传统的ITMM算法能够得到更好的分离效果.此外,对算法的时耗问题进行了研究,相对于传统的ITMM算法,所提算法的时耗增加不到四分之一,却能够得到更好的分离效果.

关键词:截断混合矩阵求逆;欠定盲源分离;频域;矩阵填充

中图分类号:TP391

Received:2015-11-23.

Biographies:Peng Tianliang(1984—),male,graduate; Chen Yang(corresponding author), male, doctor, professor, cheny@seu.edu.cn.

Foundation item:The National Natural Science Foundation of China (No.60872074).

Citation:Peng Tianliang, Chen Yang. A two-stage frequency-domain blind source separation method for underdetermined instantaneous mixtures[J].Journal of Southeast University (English Edition),2016,32(2):135-140.doi:10.3969/j.issn.1003-7985.2016.02.001.

doi:10.3969/j.issn.1003-7985.2016.02.001