## I. Introduction

It is well known that stereophonic sound provides a more pleasant and natural experience than monaural (monophonic) sound on account of the presence of spatial information containing both ambience and/or the distinguished relative positions of objects and events [1]. Thus, a monaural listening experience can be greatly improved if the corresponding spatial information is provided.

In general, the spatial cues that produce stereophonic effects comprise the inter-channel intensity difference (IID), inter-channel phase difference (IPD), and inter-channel coherence (ICC) [2]. IID and IPD relate to sound localization factors, such as the relative position, while ICC characterizes the wideness of the auditory image [3]. The aim of this study was to regenerate stereophonic effects for a given monaural sound, as shown in Fig. 1. Assuming that a sound source moves around a dotted circle, as indicated in Fig. 1, the sound localization parameters, such as the IID and IPD, are unobtainable with a single-channel microphone [4]. Therefore, this study focused on reproducing the wideness of the stereophonic effect.

Several works have reported the generating of a stereophonic signal from a monophonic signal [5]–[7]. Among them, parametric stereo methods [2]–[5] and a Gaussian mixture model (GMM) with a hidden Markov model (HMM)-based artificial stereo extension method [6], [7] have been successfully applied to this task. However, the parametric stereo method described in [2] uses the additional spatial information of the target stereo channel that is coded in extra bits. Thus, it is not a general solution for the stereo extension of arbitrary input signals. Instead, the parametric stereo method was realized in more common use by controlling the ICC parameter without additional bits [5]. Nevertheless, such ICC-based artificial stereo extension methods usually fail to estimate the desired stereophonic image because the manually chosen ICC parameter hardly tracks the ICC of real stereo signals, which often change over time [6]. This is because the ICC of real stereo signals varies over time. As an alternative, a statistical approach for estimating the spectra of stereo signals was proposed using HMM [7]. It was reported in previous works that the HMM-based stereo extension method outperformed ICC or GMM-based methods because the HMM was more suitable for modeling the time-varying characteristics of the spatial information over frames [7].

Recently, deep neural network (DNN)-based speech and/or audio processing applications, including speech recognition and speech synthesis, have exceeded the HMM counterparts [8]. DNN also showed its effectiveness in artificial stereo extension with a more accurate generation of spatial information than the HMM-based method [9]. Nonetheless, this DNN-based method oversimplified the spatial information of the full-band audio by representing it with a low-dimensional feature set, which is a line spectral frequency (LSF) of the 30th order. Thus, a more complex model that suitably represents audio signals is required for further improvement of the DNN-based stereo extension.

Multi-band or sub-band representation of audio has been utilized in a variety of applications, such as audio coding [10], [11], audio enhancement [12], audio upmixing [13], and automatic speech recognition (ASR) [14]. For example, several audio coding standards, including MPEG Surround [10] and MPEG-H [11], have incorporated a quadrature mirror filter (QMF) to obtain uniformly distributed and oversampled frequency representations of audio signals. In addition, multi-band representation has been applied to audio enhancement and has improved the enhancement quality by variously performing noise attenuation according to the given sub-bands [12]. Moreover, a DNN was realized in a sub-band manner for ASR, which resulted in the reduction of the average word error rate compared to that in a full-band manner [14].

This kind of sub-band approach was also applied to audio upmixing [13], whereby the rear and center channels in a 5.1-channel audio playback system were modeled by using a single DNN model. The input feature vector for the DNN in the model was constructed by concatenating all the sub-band spectral features. In other words, the actually trained DNN model was a full-band approach using sub-band features.

In this paper, a multi-band DNN approach is proposed for extending a mono audio signal into a stereo one. As previously mentioned, the proposed method is intended to model a DNN for each sub-band for stereo extension, while only one DNN was used together over all the sub-bands in [13]. To this end, the proposed method represents the stereo channel as a set of the band-wise log-spectral magnitude and the unwrapped phase of mid/side signals, which comprise the dominant and residual portions of the stereo channel [2]. Specifically, a 32-channel QMF [15] is applied in both the DNN training and the stereo extension stages. Unlike the conventional DNN-based method, the proposed method trains multiple DNNs for each sub-band, which models the band-wise nonlinearity between the mid and side signals. Once the multi-band DNNs are prepared, the log spectral magnitude and unwrapped phase of each side signal band are estimated via feed-forward decoding at the stereo extension stage. The estimated sub-band signals are then combined into a full-band signal via QMF synthesis. Finally, artificially extended stereo signals are obtained by adding (or subtracting) the input mono signal and the estimated full-band side signal.

The performance of the proposed stereo extension method was compared with those of conventional full-band stereo extension methods, including ICC [5], HMM [7], and DNN with LSF features [9]. Moreover, the proposed method was then compared with a multi-band DNN-based audio upmixing method [13]. In addition, to compare the proposed method with a multi-band HMM method, the full-band HMM-based method in [7] was modified into a multi-band HMM-based method.

The remainder of this paper is organized as follows. Section II describes conventional stereo extension methods. In Section III, the multi-band DNN-based stereo extension method is proposed. In Section IV, the performance of the proposed stereo extension method is evaluated and compared with those of conventional full-band and sub-band extension methods. Section V concludes the paper.

## II. Conventional Stereo Extension Methods

In this section, the conventional methods that provide stereophonic effects, such as ICC-based and HMM-based stereo extension methods, are introduced.

### 1. ICC-Based Stereo Extension Method

Figure 2 shows a signal flow graph of the conventional stereo extension method based on ICC [5]. As shown in the figure, the extended stereo-channel signals, *x*_{L}(*n*) and *x*_{R}(*n*), can be obtained as [4], [5]

(1) |

(2) |

where *x*_{m}(*n*) is a mono signal, and *d*(*n*) is the impulse response of a decorrelator. In addition, * denotes a convolution operation.

Next, the mono signal *x*_{m}(*n*) and the decorrelated signal *x*_{m}(*n*) * *d*(*n*) are weighted by scale factors *g*_{m} and *g*_{d}, which are defined as

(3) |

(4) |

Here, it is required that the two scale factors should be correlated with the ICC, defined as [4], [5]

(5) |

where *N* is the number of samples used for the ICC computation. In this paper, the ICC is set to zero so that the left and right signals of the stereo signals can be at a maximal distance apart for each loud speaker.

However, since the ICC of real stereo signals sometimes varies, the stereo signals generated by the ICC-based stereo extension method differ from real stereo signals. As an alternative, a statistical method that uses HMM is proposed to obtain stereo signals by estimating the side signal for a given mono input signal. The following subsection briefly describes the HMM-based stereo extension method.

### 2. HMM-Based Stereo Extension Method

Figure 3 shows the procedure of the conventional HMM-based stereo extension method that generates artificial stereo signals from mono signals [7]. As shown in the figure, this method is applied in the modified discrete cosine transform (MDCT) domain. This means that the mono signal, *x*_{m,t}(*n*), is segmented into a consecutive sequence of frames and the *t*-th frame mono signal. Then, *x*_{m,t}(*n*) is transformed into the frequency domain using a 2*N*-point MDCT. The MDCT coefficients of the mono signal, *X*_{m,t}(*k*), are grouped into 15 sub-bands, where each sub-band includes *N*/8 MDCT coefficients and is overlapped by *N*/16 MDCT coefficients. Next, the sub-band energy, *E*_{m,t}(*b*), is extracted from *X*_{m,t}(*k*), and *X*_{m,t}(*k*) is normalized by *E*_{m,t}(*b*), where *b* is the sub-band index and *b* = 0, … , 14.

Basically, this method uses a mid-side stereo approach for the extension. That is, the incoming mono signal is considered the mid signal. Thus, a side signal should be estimated using HMM. To this end, the sub-band energy of the side signal
${\widehat{E}}_{\text{s},t}(b)$ is estimated from *E*_{m,t}(*b*) using HMM under a minimum mean-squared error (MMSE) criterion [7].

Next, the MDCT coefficients of the side signal,
${\widehat{X}}_{\text{s},t}(k),$ are estimated based on both *X*_{m,t}(*k*) and
${\widehat{E}}_{\text{s},t}(b)$ . Then, the side signal
${\widehat{x}}_{\text{s},t}(n)$ in the time domain is obtained by applying an *N*-point inverse MDCT (IMDCT). Finally, the stereo signal is obtained by adding
${\widehat{x}}_{\text{s},t}(n)$ to (or subtracting it from) the mono signal, *x*_{m,t}(*n*).

HMM-based stereo extension performs better for both speech and music signals than the ICC-based method. However, a single HMM with limited parameters is insufficient for modeling the nonlinear relationship between the mono and side signal [9]. Recent approaches have used DNN to estimate the side signal for the mono signal input to cope with this issue. The following subsection summarizes the DNN-based stereo extension method.

### 3. DNN-Based Stereo Extension Method

Figure 4 illustrates a block diagram of the DNN-based mono-to-stereo extension method. Similar to the HMM-based method, the monaural signals are assumed to be mid signal *x*_{m,t}(*n*) for extended stereo signals. Then, side signals *x*_{s,t}(*n*) are estimated by feed-forwarding *x*_{m,t}(*n*) into the DNN, which consists of the features of the mid and side signals as input and output layers, respectively. Specifically, the residual signal for *x*_{m,t}(*n*) is first obtained by performing *M*-th order linear prediction (LP) analysis [16], such as

(6) |

where *a*m,*t*(*l*) and *r*_{m,t}(*n*) are the LP coefficients of the mid signals and the residual signals, respectively. From that point, the LP coefficients of the mid signals are converted into LSF coefficients [17]. The LSF coefficients of both the mid and side signals of the training set are then used to train the DNN model.

Unsupervised pre-training is first conducted by stacking multiple restricted Boltzmann machines (RBMs) to train the DNN model [18]. Specifically, the input layer connected to a hidden layer is a Gaussian–Bernoulli RBM. Moreover, a pile of Bernoulli–Bernoulli RBMs is stacked behind the Gaussian–Bernoulli RBM. Next, supervised fine-tuning is conducted for the back-propagation method with the MMSE cost function for the target, that is, referencing the LSF features of the side signals [19].

After the DNN has been trained, the LSF coefficients of the side signals are estimated from the DNN and are converted into LP coefficients. Next, the estimated side signal ${\widehat{x}}_{\text{s},t}(n)$ is reconstructed using the residual signals for the mid signals and the estimated LP coefficients as

(7) |

where
${\widehat{a}}_{\text{s},t}(l)$ is the *l*-th LP coefficient estimated from the DNN model. Finally, the stereophonic signals are obtained by adding and subtracting the mid and side signals.

The DNN-based method with LSF features obtained slightly better results than the HMM-based method. As described thus far, the conventional method operates with full-band spectral features. However, it is known that the stereo characteristics of audio signals are localized for some frequency bands [20], [21]. Subjective listening tests in [21] indicate that the differences between the left and right channels are hardly noticeable for a low frequency band below 250 Hz and a high frequency band above 5 kHz. On the other hand, their differences are easily noticeable for a frequency band around 1 kHz, which enables differentiation of the left channel audio from the right one [21]. Thus, the proposed multi-band stereo extension method is motivated by this frequency-dependent similarity and difference between each channel of stereo audio. This is because it is expected to further improve the performance of the full-band DNN-based stereo extension method if the DNN is modeled for each sub-band.

## III. Proposed Stereo Extension Method

The DNN-based stereo extension method that incorporates multi-band processing is proposed in this section. Figure 5 depicts a block diagram of the proposed method. As shown in the figure, the proposed method estimates side-signals *x*_{s,t}(*n*) using multi-band DNNs, which act as a mapping function between the sub-band of the mid and side signals. Similar to the conventional stereo extension methods based on HMM, as well as the DNN with LSF features described in Section II, the proposed method consists of training and stereo extension stages. Each stage is detailed in the following subsections.

### 1. Multi-band DNN Training

The set of stereo audio signals is first prepared in a mid/side form to train the multi-band DNNs. Next, pairs of mid and side signals are divided into *B* sub-band signals,
${x}_{\text{m},t}^{b}(n)$ and
${x}_{\text{s},t}^{b}(n)$ , respectively, using QMF analysis [15]. Then, both
${x}_{\text{m},t}^{b}(n)$ and
${x}_{\text{s},t}^{b}(n)$ are transformed into complex spectra,
${X}_{\text{m},t}^{b}(k)$ and
${X}_{\text{s},t}^{b}(k)$ , using a short-time Fourier transform (STFT) with a *K*-point fast Fourier transform (FFT). From that point, their respective log spectral magnitudes and unwrapped phases
$\mathrm{log}\left|{X}_{\text{m},t}^{b}(k)\right|,\text{\hspace{0.17em}}\mathrm{log}\left|{X}_{\text{s},t}^{b}(k)\right|,\text{\hspace{0.17em}}U\left[\angle {X}_{\text{m},t}^{b}(k)\right],$ and
$U\left[\angle {X}_{\text{s},t}^{b}(k)\right],$ are extracted. Here, *U*[·] indicates a phase unwrap function [22]. Prior to DNN training, these log spectral magnitudes and unwrapped phases are scaled to have values between zero and one, as

(8) |

(9) |

where *α* is the maximum log spectral magnitude, which can be defined as log(2^{15} × *K*) [23]. Moreover, *β* is the maximum unwrapped phase value, which is empirically found for the abundant unwrapped phase values of the training dataset. Note that
${L}_{\text{s},t}^{b}(k)$ and
${P}_{\text{s},t}^{b}(k)$ , which are the normalized version of
$\mathrm{log}\left|{X}_{\text{s},t}^{b}(k)\right|$ and
$U\left[\angle {X}_{\text{s},t}^{b}(k)\right]$ , are also obtained in a manner similar to (8) and (9). The scaled features for the mid and side signals consist of feature vectors of the *b*-th sub-band DNN as

(10) |

(11) |

where the size of both
${f}_{\text{m},t}^{b}$ and
${f}_{\text{s},t}^{b}$ are *K* + 2. Next, the multi-band DNNs are trained for each *b*-th sub-band using a sequence of multiple feature vectors,
${F}_{\text{m},t}^{b}=[{f}_{\text{m},t-S}^{b}\cdots {f}_{\text{m},t}^{b}\cdots \text{\hspace{0.17em}\hspace{0.17em}}{f}_{\text{m},t+S}^{b}],$ and
${f}_{\text{s},t}^{b}$ as a pair of input and output layers for the networks. Specifically, multi-band DNNs are first initialized as a deep generative model by stacking multiple restricted Boltzmann machines (RBMs) [18]. Similar to the DNN-LSF method introduced in Section II.3, the input layer of linear variables is represented as a Gaussian–Bernoulli RBM. From that point, a pile of Bernoulli–Bernoulli RBMs is stacked behind the Gaussian–Bernoulli RBM. After the initialization of the multi-band DNNs has finished, supervised fine-tuning is conducted using the back-propagation method with the MMSE cost function between the oracle and estimations of
${f}_{\text{s},t}^{b}$ . Note that the fine-tuning is iteratively conducted in a pair of feed-forward decoding and back-propagation with the development dataset until the MMSE is reduced to the predefined threshold.

### 2. Stereo Extension with Multi-band DNN

In the stereo extension stage, input mono channel audio signal *x*_{m,t}(*n*) is divided into *B*-sub-band signals
${x}_{\text{m},t}^{b}(n)$ via QMF analysis [15]. The sub-band signals of each channel are transformed into a log spectra magnitude and an unwrapped phase. They are then scaled to the range of zero to one, thus obtaining
${f}_{\text{m},t}^{b}=\left\{{L}_{\text{m},t}^{b}(k),{P}_{\text{m},t}^{b}(k)\right\}.$ Then, a multiple feature vector,
${F}_{\text{m},t}^{b}=[{f}_{\text{m},t-S}^{b}\cdots \text{\hspace{0.17em}\hspace{0.17em}}{f}_{\text{m},t}^{b}\cdots \text{\hspace{0.17em}\hspace{0.17em}}{f}_{\text{m},t+S}^{b}],$ is assigned to the input layer of the *b*-th sub-band DNN model. In other words,
${F}_{\text{m},t}^{b}$ is applied to the feed-forward decoding on the *b*-th sub-band DNN model to estimate the features for the side signal
${\widehat{f}}_{\text{s},t}^{b}=\left\{{\widehat{L}}_{\text{s},t}^{b}(k),{\widehat{P}}_{\text{s},t}^{b}(k)\right\}$ at the last layer of the networks. After decoding is finished, each
${\widehat{L}}_{\text{s},t}^{b}(k)$ and
${\widehat{P}}_{\text{s},t}^{b}(k)$ is denormalized as

(12) |

(13) |

Next,
$\mathrm{log}\left|{\widehat{X}}_{\text{s},t}^{b}(k)\right|$ and
$U\left[\angle {\widehat{X}}_{\text{s},t}^{b}(k)\right]$ are each applied to the exponential and phase wrap functions to obtain
${\widehat{X}}_{\text{s},t}^{b}(k)=\left|{\widehat{X}}_{\text{s},t}^{b}(k)\right|\angle {\widehat{X}}_{\text{s},t}^{b}(k)$ .
${\widehat{X}}_{\text{s},t}^{b}(k)$ is then applied to the inverse STFT, thus obtaining
${\widehat{x}}_{\text{s},t}^{b}(n).$ Consequently, the full-band side signal
${\widehat{x}}_{\text{s},t}(n)$ is estimated by conducting QMF synthesis [15] on
${\widehat{x}}_{\text{s},t}^{b}(n)$ for every *B* sub-band. Finally, stereo extension is conducted by converting the mid/side format into a left/right format with
${\widehat{x}}_{\text{s},t}(n)$ as

(14) |

(15) |

## IV. Performance Evaluation

In this section, the performance of the proposed stereo extension method is evaluated in terms of both objective and subjective qualities by measuring the log spectral distortion (LSD) [24] and multiple stimuli with a hidden reference and anchor (MUSHRA) [25]. In addition, the performance of the proposed stereo extension method is first compared with those of conventional full-band stereo extension methods, including ICC [5], HMM [7], and DNN with LSF features (DNN-LSF) [9]. Then, the proposed method is compared with a multi-band DNN-based audio upmixing (DNN-AU) approach [13]. To compare the proposed method with a multi-band HMM method, the full-band HMM-based method in [7] is modified into a multi-band HMM-based method by replacing the sub-band-DNNs with sub-band HMMs, as shown in Fig. 5. In other words, an MDCT is applied once every sub-band and the MDCT coefficients for each sub-band are used to train an HMM of the corresponding sub-band, which is herein referenced as MB-HMM.

### 1. Experimental Setup

To prepare the stereo extension methods, 90 min of speech and 2 h of music data recorded in a stereo format were used. The speech databases used in training consisted of 20 min of Sound Quality Assessment Material (SQAM) [25], 50 min of the ETRI SWB Korean speech corpus [26], and 20 min of the TSP speech DB [27]. The music databases used in the training consisted of 40 min of SQAM, 20 min of orchestra, 30 min of popular music, and 30 min of audio-form user-created content (UCC). The total 3.5-h set was then split into training, development, and test sets at a ratio of 7:2:1. There was no overlap among the split sets, and every database clip was down-sampled to 32 kHz with 16-bit resolution.

The audio signal was segmented into a consecutive number of frames. Each frame length was 32 ms and the overlapped size was 16 ms. Thus, a 2,048-point MDCT (*N* = 1,024 in Section II.2) was used to transform the time-domain signal into the frequency domain one. These MDCT coefficients were brought to the HMM-based stereo extension method as feature vectors. To train a DNN for DNN-LSF, the LSF feature extraction method was applied to each audio frame, where the order of LSF was set to *M* = 30. The DNN had an input layer with 330 nodes, and it had five hidden layers with 2,048 nodes each. In addition, the output layer had 30 nodes.

The multi-band stereo extension methods in the experiments commonly employed a 32-channel QMF (*B* = 32 in Section III). To implement the DNN-AU in [13], two DNNs were constructed to estimate the respective left and right channel signals from the mono signal, respectively. As mentioned in Section I, the feature vectors for DNNs in DNN-AU were constructed by concatenating all the sub-band spectral features. Additionally, the size of the combined feature vector for a frame was 1,056 because each sub-band consisted of 33 spectral magnitudes. The combined feature vector was then spliced across 11 neighboring frames. Thus, the DNNs in DNN-AU had an input layer with 11,616 nodes, and it had five hidden layers with 2,048 nodes each.

In the proposed stereo extension method, each sub-band DNN was trained using feature vectors that were obtained by applying a 64-point FFT to each channel signal after QMF analysis. This 64-dimensional spectral feature vector was spliced across 11 neighboring frames (*S* = 5 in Section III), and they were used for the input layer of each DNN. In addition, the number of hidden layers with 256 nodes each was set to three. The learning rate and number of iterations were set to 0.008 and 100, respectively, for training DNNs for DNN-LSF, DNN-AU, and the proposed method.

### 2. Quality Evaluation

The artificial stereo extension methods, including the proposed method, were compared in terms of their objective and subjective qualities. The LSD between the true stereo signal and extended counterpart was measured in a manner given by [24] to measure the objective quality.

where
$\left|{X}_{c,t}(k)\right|$ and
$\left|{\widehat{X}}_{c,t}(k)\right|$ are the *k*-th spectral magnitudes of *x _{c,t}*(

*n*) and $\left|{\widehat{x}}_{c,t}(n)\right|$ for the

*c*-th channel, respectively. Moreover,

*T*indicates the total number of frames. Figure 6 compares the average LSD between the reference stereo spectra by the proposed multi-band DNN-based method and those by the conventional methods, such as ICC, HMM, DNN-LSF, MB-HMM, and DNN-AU. As shown in the figure, the proposed stereo extension method has a lower LSD value than conventional methods.

For the subjective evaluation, MUSHRA [25] was performed on ten listeners (seven males and three females) who had no auditory diseases. The classes of the MUSHRA test were as follows: 1) a hidden reference (true stereo audio signal); 2) anchor signals processed with a low-pass filter of 7 kHz and 3) 14 kHz; 4) a mono audio signal; 5) artificial stereo audio signals extended by ICC [5]; 6) HMM [7]; 7) MB-HMM; 8) DNN-LSF [9]; 9) DNN-AU [13]; and 10) the proposed method using multi-band DNN. All audio clips were hidden and randomly selected.

Figure 7 shows the results of the MUSHRA listening test. As shown in the figure, the multi-band approaches were better than full-band approaches. That is, the average MUSHRA score of the multi-band HMM (MB-HMM) was higher than that of the full-band HMM. The proposed multi-band extension method and DNN-AU produced higher average MUSHRA scores than DNN-LSF (that is, the full-band extension method). However, DNN-AU had an average MUSHRA score similar to MB-HMM. A comparison of the proposed method with other sub-band methods, such as MB-HMM and DNN-AU, showed that the proposed method provided significantly higher average MUSHRA scores.

From the results of the objective and subjective evaluations, it is concluded that the proposed multi-band DNN-based stereo extension method can extend mono audio into stereo with a higher quality than conventional methods, including full-band HMM, sub-band HMM, and full-band DNN methods.

## V. Conclusion

In this paper, a stereo extension method that applies multi-band DNNs was proposed. The method utilizes QMF analysis to train the DNN of each sub-band to estimate a more realistic side signal for the extension. Its sub-band signals are decoded by DNNs for the extension of an input mono signal to estimate the corresponding side signal of each sub-band. After the sub-band side signals are merged by QMF synthesis, artificial stereo signals are finally obtained by adding or subtracting the estimated side signals to the mono signal. Respective objective and subjective evaluations were conducted to demonstrate the performance of the proposed method. The results of the LSD and MUSHRA evaluations showed that the proposed stereo extension method significantly outperformed the conventional stereo extension methods.