A two-stage frequency-domain blind source separation methodfor underdetermined instantaneous mixtures

(School of Information Science and Engineering, Southeast University, Nanjing 210096, China)

Abstract：In order to decrease the probability of missing some data points or noises being added in the inverse truncated mixing matrix (ITMM) algorithm, a two-stage frequency-domain method is proposed for blind source separation of underdetermined instantaneous mixtures. The separation process is decomposed into two steps of ITMM and matrix completion in the view that there are many soft-sparse (not very sparse) sources. First, the mixing matrix is estimated and the sources are recovered by the traditional ITMM algorithm in the frequency domain. Then, in order to retrieve the missing data and remove noises, the matrix completion technique is applied to each preliminary estimated source by the traditional ITMM algorithm in the frequency domain. Simulations show that, compared with the traditional ITMM algorithms, the proposed two-stage algorithm has better separation performances. In addition, the time consumption problem is considered. The proposed algorithm outperforms the traditional ITMM algorithm at a cost of no more than one-fourth extra time consumption.

Key words：inverse truncated mixing matrix; under-determined blind source separation (UBSS); frequency domain; matrix completion

The underdetermined blind source separation (UBSS) technique is the case that the number of sensors is fewer than that of the sources. The sparseness-based approach has been widely employed to solve the UBSS problem. Sparseness-based methods can be mainly classified into two classes. The first class is a two-stage approach: mixing matrix estimation and source recovery. Many researchers have proposed different source recovery approaches, such as the maximum a posteriori (MAP) assumption[1-2], the MMSE method[3], and the inverse truncated mixing matrix (ITMM) algorithm[4]. The second class extracts each source with time-frequency binary masks[5-6]. The ITMM is a powerful method for underdetermined blind source separation[4]. The separation performance is usually affected by the assumptions and parameter sizes of the ITMM algorithm. In order to improve the robustness and quality of the ITMM algorithm, we propose a two-stage method with joint matrix completion and ITMM algorithm in this paper. We apply the ITMM algorithm to estimate the time-frequency (TF) domain source in the first stage. Then, the matrix completion algorithm is implemented to every estimated source’s TF matrix. In our simulation, the mixtures of audio and speech signals are considered and the results show that this approach yields a good performance.

1　ITMM Method for UBSS

The block diagram of the proposed underdetermined UBSS algorithm is shown in Fig.1.

For the instantaneous UBSS case, let xt={xt(1), xt(2), …, xt(M)}T be an M-dimensional column vector corresponding to the output of M sensors at a given discrete time instant t, and let X be an M×T matrix corresponding to the sensor data at all times t=1, 2,…,T. Let S={s1, s2, …, sT} be the N×T matrix of underlying source signals, where st={st(1), st(2), …, st(N)}T and let A={a1, a2, …, aN} be the mixing matrix with size M×N, where ai={ai1, ai2, …, aiM}T and MThen the instantaneous UBSS case can be written as

Assumption 1　The column vectors of A are assumed to be pairwise linearly independent, that is, for any k,l∈Λ, where Λ={1, 2, …, N} and k≠l, we can derive the relationship that ak and al are linearly independent.

Assumption 1 is usually made in the context of blind source separation (BSS)[7]. It is also known that BSS goes only possibly up to an unknown scaling and an unknown permutation of the original sources[8].

In order to transform signals into the frequency domain, we apply a short-time Fourier transform (STFT) to the sensor observations xt. Then, the instantaneous mixture at each frequency domain is given by

where x(f,t)={x1(f,t),x2(f,t),…,xM(f,t)}T,s(f,t)={s1(f,t),s2(f,t),…,sN(f,t)}T are the TF representations of the sensors x(t) and the sources s(t), respectively.

Assumption 2　The number of coexisting sources at a single TF point is smaller than that of sensors. In other words, if we assume that the number of sensors is M, then the largest number of sources that exist at any TF point is M-1.

Actually Assumption 2 is a relaxed condition to the W-disjoint orthogonality of the time-frequency masking technique[6].

The ITMM-UBSS method can be divided into two sections. The first section is the mixing matrix estimation, which is realized by the mask-TFD method[3] in this paper; the second section is the source recovery in the frequency domain by the traditional ITMM algorithm. As the ITMM algorithm shown in Ref.[4], we can obtain the general version of si(t,f ) through the union of TF points as follows:

where Im(x) represents the imaginary part of x; x* is the conjugate of x; δik1k2…kM-2 is a very small positive value. In Matlab simulations, we write Eq.(6) as

2　Matrix Completion

Many signals become sparse after they are transformed into the frequency domain, such as speech signals. Suppose that we have a rank-r matrix A of size M×N, where r≤min(M, N). In many engineering problems, the entries of the matrix are often corrupted by errors or noise, and some of the entries can even be missed, or only a set of measurements of the matrix is accessible rather than its entries directly. In general, we model the observed matrix B to be a set of linear measurements on matrix A, subject to noise and gross corruptions, i.e.,

where L is a linear operator, and n represents the matrix of corruptions. We seek to recover the true matrix A from B. We now consider the case where L is the identity operator and n is a sparse matrix (i.e., most of its entries are zero) but whose non-zero entries can be practically unbounded. Since the rank r of A is unknown, the problem is to find the matrix of the lowest rank that can generate B when added to an unknown sparse matrix n. Mathematically, for an appropriate choice of parameter η>0, we have the following combinatorial optimization problem to solve:

where η is a positive constant, and ‖·‖0 is the counting norm (i.e., the number of non-zero entries in the matrix). Since this problem cannot be efficiently solved, we consider its convex relaxation instead:

where ‖

σk(A); σk(A) denotes the k-th largest singular value of A; and ‖·‖1 represents the matrix 1-norm (i.e., the sum of absolute values of all entries of the matrix). We call the above convex program the robust principal component analysis (RPCA)[9-10].

3　Joint Matrix Completion with ITMM

Suppose that we have recovered sources in the frequency domain by the ITMM method. We assume that the estimated source matrix is

i with the size of F×T, and F and T are the number of frequency bins and time frames, respectively. The elements of

i are

i(f, t) (1≤f≤F, 1≤t≤T). We assume that

i and the true sources Si have the following relationship:

Then the problem that we want to obtain Si via Eq.(11) is equivalent to the matrix completion problem as in Eq.(8). So, we can use the matrix completion technique for every frequency domain matrix of estimated sources to retrieve the missing data and remove the noise data. We use the method in Ref.[10] to simulate the matrix completion in this paper.

In summary, the joint matrix completion with the ITMM algorithm is as follows:

1) The traditional ITMM-UBSS algorithm is used in the frequency domain. First, estimate the mixing matrix with the mask-STD method[3].Secondly, recover sources with the ITMM algorithm.

2) The matrix completion method is used to estimate each source based on the traditional ITMM-UBSS algorithm in the frequency domain. Then, the time domain sources can be obtained by the inverse STFT.

4　Simulation

To demonstrate the two-stage method, we divide the simulation into two sections: One is the SiSEC 2008 data[11] simulation, and the other is the simulation with random M×6 mixing matrix A for M=2, 3, …, 5.

4.1　SiSEC 2008 data

A linear instantaneous stereo mixture (with positive mixing coefficients) of two drum sources and one bass line, and four female speeches mixed with three sensors are used in this paper.

In this paper, the length of STFT is 1 024; the window overlap is 0.5; all the signals are sampled at 16 kHz; and the sample length is 160 000. Then, the size of

i is 512×313. The algorithms are coded with Matlab and run on an Intel Core(TM) 2 Duo (2.20 GHz) processor.

Fig.2 shows the spectrum of two sensors. The spectrum of three sources, three separated sources with the traditional ITMM, and three separated sources with the proposed method are shown in Fig.3, Fig.4 and Fig.5, respectively. The δi of the traditional ITMM and the proposed method are all set to be δ1=δ2=δ3=0.3.

As shown in Fig.4 and Fig.5, since the matrix completion technique in the proposed method can retrieve the missing data and reject noise, compared with the traditional ITMM algorithm, the spectra of the proposed method can be seen more clearly. The traditional ITMM algorithm must satisfy Assumption 1 and Assumption 2, which may be the main reason for corruption by errors or noise in the time-frequency domain of estimated sources.

To measure the performance, we decompose an estimated signal

=starget+einterf+eartif+enoise[12]. As the measurement of performance, we use the source-to-distortion ratio

The higher values of εSDR, εSIR and εSAR indicate better performance. We must note that the ITMM is a source recovery method, and the mixing matrix estimation problem in UBSS is not considered. We use the method proposed in Ref.[3] to estimate the mixing matrix in this paper.

Tab.1 is the performance evaluation of the ITMM and the proposed algorithms of mixtures with three audio sources, namely S1, S2 and S3, and two sensors under δ1=δ2=δ3=0.3. As shown in Tab.1, the proposed method can obtain a higher performance than the traditional ITMM method except for εSIR of the third source. Tab.2 is the performance evaluation of the ITMM and the proposed methods of mixtures with four speeches and three sensors under δ1=δ2=δ3=δ4=0.3. As shown in Tab.2, the proposed method can obtain a higher performance than the traditional ITMM method except for εSIR of the fourth source. However, the average performance evaluations of the proposed method are higher than those of the traditional ITMM method as shown in Tab.1 and Tab.2. The reason is that the matrix completion of the proposed joint method can complete and extend the frequency bin of estimated sources.

Fig.6 shows the average performances of εSDR, εSIR, εSAR for separating the three speech sources from two sensors under the condition of different values of δi. As shown in Fig.6, εSDR, εSIR, and εSAR can reach their maximum values during different intervals, and obtain high values when 0.2≤δi≤0.4, 0.01≤δi≤0.2, and 0.8≤δi≤1.0, respectively. Our simulation suggests that the speeches and audio waves maintain high quality in the auditory perception when 0.1≤δi≤0.4.

4.2　Simulations with random M×6 mixing matrix A for M=2, 3, 4, …, 5

The random simulations are completed in this section. We choose the random M×6 mixing matrix A for M=2, 3, 4, …, 5. Three speech sources and three music sources are chosen from SiSEC 2008 data[11]. We carry out the random simulations 50 times for each mixing model. Fig.7 shows the simulations of the six sources in the time domain.

Tab.3 is the average quality evaluation of the ITMM and the proposed methods in random cases. As shown in Tab.3, εSDR, εSIRand εSAR of the proposed method are all higher than those of the traditional ITMM algorithm except for εSIR of 2×6 and 4×6 when δ=0.05.

Tab.4 shows the average computational time (in all the cases shown in Tab.3) in random simulations. As shown in Tab.4, the proposed method consumes about 4 s more than the traditional ITMM method in all the cases. The proposed method is a little time-consuming but its average performances are improved.

5　Conclusion

In this paper, we propose a two-stage frequency domain algorithm for underdetermined instantaneous blind source separation. In the first stage, we use the ITMM algorithm[4] to extract the sources in the frequency domain. In the second stage, we use the matrix completion technique to retrieve the missing data and remove the noise data of the estimated source matrix (corrupted by errors or noise) in the time-frequency domain. Simulation results show that the proposed algorithm has a higher performance compared with the conventional ITMM algorithm.

[1]Cobos M, Lopez J J. Maximum a posteriori binary mask estimation for underdetermined source separation using smoothed posteriors [J]. IEEE Transactions on Audio, Speech, and Language Processing, 2012, 20(7): 2059-2064. DOI:10.1109/tasl.2012.2195654.

[2]Leglaive S, Badeau R, Richard G. Multichannel audio source separation with probabilistic reverberation modeling[C]//IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). New York, USA, 2015: 1-5.

[3]Peng Tianliang, Chen Yang, Liu Zengli. A time-frequency domain blind source separation method for underdetermined instantaneous mixtures [J]. Circuits Systems & Signal Processing, 2015, 34(12): 3883-3895.

[4]Peng Tianliang, Chen Yang. Inverse truncated mixing matrix (ITMM) algorithm application to underdetermined convolutive blind speech sources separation [C]//IEEE International Conference on Signal Processing, Communications and Computing. Ningbo, China, 2015: 771-776.

[5]Yilmaz O, Rickard S. Blind separation of speech mixtures via time-frequency masking [J]. IEEE Transactions on Signal Processing, 2004, 52(7):1830-1847.

[6]Sawada H, Araki S, Makino S. Underdetermined convolutive blind source separation via frequency bin-wise clustering and permutation alignment [J].IEEE Transactions on Audio, Speech, and Language Processing, 2011, 19(3): 516-527. DOI:10.1109/tasl.2010.2051355.

[7]Linh-Trung N, Belouchrani A, Abed-Meraim K, et al. Separating more sources than sensors using time-frequency distributions [J]. EURASIP Journal on Applied Signal Processing, 2005, 2005: 2828-2847. DOI:10.1155/asp.2005.2828.

[8]Belouchrani A, Abed-Meraim K, Cardoso J F, et al. A blind source separation technique using second-order statistics [J]. IEEE Transactions on Signal Processing, 1997, 45(2): 434-444. DOI:10.1109/78.554307.

[9]Cai J F, Candès E J, Shen Z. A singular value thresholding algorithm for matrix completion [J]. SIAM Journal on Optimization, 2010, 20(4):1956-1982. DOI:10.1137/080738970.

[10]Candès E J, Recht B. Exact matrix completion via convex optimization [J]. Foundations of Computational Mathematics, 2009, 9(6): 717-772.

[11]Vincent E, Sawada H, Bofill P, et al. First stereo audio source separation evaluation campaign: Data, algorithms and results [M]//Independent Component Analysis and Signal Separation. Berlin: Springer, 2007: 552-559.

[12]Vincent E, Gribonval R, Févotte C. Performance measurement in blind audio source separation [J]. IEEE Transactions on Audio, Speech, and Language Processing, 2006, 14(4): 1462-1469. DOI:10.1109/tsa.2005.858005.

一种基于频域的欠定瞬时混合2步盲分离法

摘要：为了减少传统的截断混合矩阵求逆(ITMM)算法在个别时频点会丢失数据或者产生噪声信号的概率,提出了一种基于频域的2步欠定瞬时盲分离算法．由于现实中存在大量软稀疏(稀疏度不是很大)混合信号,将分离过程分解为ITMM和矩阵补偿2个步骤.首先估计出混合矩阵和利用经典的ITMM算法对混合信号进行初步恢复,然后对初步估计的信号时频矩阵进行矩阵补偿处理,从而达到修补丢失数据和去除多余数据(去噪)的效果．实验仿真验证了所提出的2步分离法相对于传统的ITMM算法能够得到更好的分离效果．此外,对算法的时耗问题进行了研究,相对于传统的ITMM算法,所提算法的时耗增加不到四分之一,却能够得到更好的分离效果．

Biographies：Peng Tianliang(1984—),male,graduate; Chen Yang(corresponding author), male, doctor, professor, cheny@seu.edu.cn.

Foundation item：The National Natural Science Foundation of China (No.60872074).

Citation：Peng Tianliang, Chen Yang. A two-stage frequency-domain blind source separation method for underdetermined instantaneous mixtures[J]．Journal of Southeast University (English Edition),2016,32(2):135-140．doi:10.3969/j.issn.1003-7985.2016.02.001.

A two-stage frequency-domain blind source separation methodfor underdetermined instantaneous mixtures

1 ITMM Method for UBSS

2 Matrix Completion

3 Joint Matrix Completion with ITMM