Voice recognition

Voice is a combination of physiological and behavioural biometrics. The features of an individual’s voice are based on the shape and size of the appendages (vocal tracts, mouth, nasal cavities, and lips) that are used in the synthesis of the sound.

Text-dependent voice recognition systems are based on the utterance of fixed predetermined phrases. Text-independent voice recognition systems recognize the speaker independent of what she/he speaks. A text-independent system is more difficult to design than a text-dependent system but offers more protection against fraud.

The physiological characteristics of human speech are invariant for an individual, but the behavioural part of the speech of a person changes over time due to age, medical conditions (such as a common cold), and emotional state, etc. Voice is also not very distinctive and may not be appropriate for large-scale identification.

A disadvantage of voice-based recognition is that speech features are sensitive to a number of factors such as background noise. Speaker recognition is most appropriate in phone-based applications but the voice signal over phone is typically degraded in quality by the microphone and communication channel.

The main characteristics of the earliest databases (from before 2005), like ELSDSR, are English spoken by non-native speakers, and based on sessions of sentence reading and relatively extensive speech samples suitable for learning person specific speech characteristics. The Speech Recognition Wiki, written by participants of the Dataprocessing Seminar WS 14/15 at TU-München, is more general on the recognition of human speech.