Assessing the generalization gap of learning-based speech enhancement systems in noisy and reverberant environments
Deep learning-based speech enhancement algorithms have been extremely popular in recent years due to their superior performance over traditional learning-free approaches. However, the performance of learning-based systems typically degrades in acoustic conditions that were not included in the training stage. This is particularly prominent in speech processing due to the high variability of noisy and reverberant mixtures caused by the spectro-temporal characteristics of the target speaker and noise sources, the properties of the room or the signal-to-noise ratio (SNR).
The generalization of learning-based speech enhancement systems is usually evaluated by using an arbitrarily selected speech or noise database that differs from the one used during training. While this provides some information on generalization, the results are heavily influenced by the selected databases. Moreover, when using a new database to change the characteristics of the training data, the difficulty of the speech enhancement task can also change. For example, noises with spectro-temporal characteristics similar to speech are more challenging compared to stationary environmental noises. Therefore, the performance difference between training and testing is also influenced by changing the task difficulty
In the present work, we propose a novel generalization assessment framework to more accurately estimate the generalization performance of learning-based speech enhancement systems. To disentangle the effect of testing on new data from the change in task difficulty, we train a reference model on the test condition to provide an upper performance limit that reflects the difficulty of that condition. The relative performance difference to the reference model is termed the generalization gap and can be expressed in percentage. Moreover, to reduce the influence of specific training and testing databases, we repeat the evaluation in a cross-validation fashion using multiple speech, noise and binaural room impulse response (BRIR) databases. We use this framework to evaluate the generalization of a standard feedforward neural network (FFNN) and a state-of-the-art Conv-TasNet in speech, noise and room mismatches. We find that while the Conv-TasNet shows higher performance in matched conditions, the FFNN is more robust to mismatched conditions, even when training on four different speech, noise and BRIR databases. We also show that speech is the acoustic dimension affecting generalization the most, followed by room and noise, which suggests diversifying the speech material during training is the most important. The present framework quantifies the generalization performance in percentage and thus facilitate a comparison across studies.