In this paper, a well-motivated nonuniform speaker normalization model that affinely relates the formant frequencies of speakers enunciating the same sound is proposed. Using the proposed affine model, the corresponding universal-warping function that is required for normalization is shown to have the same parametric form as the mel scale formula. The parameters of this universal-warping function are estimated from the vowel formant data and are shown to be close to the commonly used formula for the mel scale. This shows an interesting connection between nonuniform speaker normalization and the psychoacoustics based mel scale. In addition, the affine model fits the vowel formant data better than commonly used ad hoc normalization models. This work is motivated by a desire to improve the performance of speaker-independent speech recognition systems, where speaker normalization is conventionally done by assuming a linear-scaling relationship between spectra of speakers. The proposed affine relation is extended to describe the relationship between spectra of speakers enunciating the same sound. On a telephone-based connected digit recognition task, the proposed model provides improved recognition performance over the linear-scaling model.
All Science Journal Classification (ASJC) codes
- Arts and Humanities (miscellaneous)
- Acoustics and Ultrasonics