Audio-visual speech processing system for Polish applicable to human-computer interaction

Tomasz Jadczyk


This paper describes audio-visual speech recognition system for Polish language and a set of performance tests under various acoustic conditions. We first present the overall structure of AVASR systems with three main areas: audio features extraction, visual features extraction and subsequently, audiovisual speech integration. We present MFCC features for audio stream with standard HMM modeling technique, then we describe appearance and shape based visual features. Subsequently we present two feature integration techniques, feature concatenation and model fusion. We also discuss the results of a set of experiments conducted to select best system setup for Polish, under noisy audio conditions. Experiments are simulating human-computer interaction in computer control case with voice commands in difficult audio environments. With Active Appearance Model (AAM) and multistream Hidden Markov Model (HMM) we can improve system accuracy by reducing Word Error Rate for more than 30%, comparing to audio-only speech recognition, when Signal-to-Noise Ratio goes down to 0dB.


audio-visual speech recognition, visual features extraction, human-computer interaction

Full Text:



Adjoudani A., Benoit C.: On the integration of auditory and visual parameters in an HMM-based ASR. In: Speechreading by humans and machines, pp. 461–471. Springer, 1996.

Barker J., Marxer R., Vincent E., Watanabe S.: The third ’CHiME’ speech se- paration and recognition challenge: Dataset, task and baselines. In: 2015 IE- EE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 504–511. 2015. URL

Bilmes J., Bartels C.: Graphical model architectures for speech recognition. In: Signal Processing Magazine, IEEE, vol. 22(5), pp. 89–100, 2005. ISSN 1053-5888. URL

Castrillón Santana M., D ́eniz Su ́arez O., Hern ́andez Tejera M., Guerra Artal C.: ENCARA2: Real-time Detection of Multiple Faces at Different Resolutions in Video Streams. In: Journal of Visual Communication and Image Representation, pp. 130–140, 2007.

Chan M.T., Zhang Y., Huang T.S.: Real-time lip tracking and bimodal conti- nuous speech recognition. In: Multimedia Signal Processing, 1998 IEEE Second Workshop on, pp. 65–70. IEEE, 1998.

Cootes T., Edwards G., Taylor C.: Active appearance models. In: Pattern Ana- lysis and Machine Intelligence, IEEE Transactions on, vol. 23(6), pp. 681–685, 2001. ISSN 0162-8828. URL

Cootes T.F., Taylor C.J., Cooper D.H., Graham J.: Active shape models-their tra- ining and application. In: Computer vision and image understanding, vol. 61(1), pp. 38–59, 1995.

Cox S.J., Harvey R., Lan Y., Newman J.L., Theobald B.J.: The challenge of multispeaker lip-reading. In: AVSP, pp. 179–184. Citeseer, 2008.

Czup L.: Lip representation by image ellipse. In: The Proceedings of the 6 ̃(th) International Conference on Spoken Language Processing (Volume IV). 2000.

Donovan A.O., Duraiswami R., Neumann J.: Microphone arrays as generalized cameras for integrated audio visual processing. In: Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on, pp. 1–8. IEEE, 2007.

Dupont S., Luettin J.: Audio-visual speech modeling for continuous speech reco- gnition. In: Multimedia, IEEE Transactions on, vol. 2(3), pp. 141–151, 2000.

Dupont S., Luettin J.: Audio-visual speech modeling for continuous speech re- cognition. In: Multimedia, IEEE Transactions on, vol. 2(3), pp. 141–151, 2000. ISSN 1520-9210. URL

Gałka J., Ziółko M.: Wavelet parametrization for speech recognition. In: Proce- edings of an ISCA tutorial and research workshop on non-linear speech processing NOLISP 2009, VIC. 2009.

Gopinath R.A.: Maximum likelihood modeling with Gaussian distributions for classification. In: Acoustics, Speech and Signal Processing, 1998. Proceedings of the 1998 IEEE International Conference on, vol. 2, pp. 661–664. IEEE, 1998.

Gowdy J., Subramanya A., Bartels C., Bilmes J.: DBN based multi-stream mo- dels for audio-visual speech recognition. In: Acoustics, Speech, and Signal Pro- cessing, 2004. Proceedings. (ICASSP ’04). IEEE International Conference on, vol. 1, pp. I–993–6 vol.1. 2004. ISSN 1520-6149. URL 1109/ICASSP.2004.1326155.

Gurbuz S., Tufekci Z., Patterson E., Gowdy J.N.: Application of affine-invariant Fourier descriptors to lipreading for audio-visual speech recognition. In: Aco- ustics, Speech, and Signal Processing, 2001. Proceedings.(ICASSP’01). 2001 IE- EE International Conference on, vol. 1, pp. 177–180. IEEE, 2001.

Harte N., Gillen E.: TCD-TIMIT: An audio-visual corpus of continuous speech. In: , 2015.

Hazen T.: Visual model structures and synchrony constraints for audio-visual speech recognition. In: Audio, Speech, and Language Processing, IEEE Trans- actions on, vol. 14(3), pp. 1082–1089, 2006. ISSN 1558-7916. URL http: //

Hazen T.J., Saenko K., La C.H., Glass J.R.: A segment-based audio-visual speech recognizer: Data collection, development, and initial experiments. In: Proceedings of the 6th international conference on Multimodal interfaces, pp. 235–242. ACM, 2004.

Hermansky H.: Perceptual linear predictive (PLP) analysis of speech. In: the Journal of the Acoustical Society of America, vol. 87(4), pp. 1738–1752, 1990.

Hernando J.: Maximum likelihood weighting of dynamic speech features for CDHMM speech recognition. In: Acoustics, Speech, and Signal Processing, 1997. ICASSP-97., 1997 IEEE International Conference on, vol. 2, pp. 1267–1270. IE- EE, 1997.

Igras M., Ziółko B., Jadczyk T.: Audiovisual database of Polish speech recordings. In: Studia Informatica, vol. 33(2B), pp. 163–172, 2013.

Joe Frankel Mirjam Wester S.K.: Articulatory feature recognition using dynamic Bayesian networks. In: Computer Speech & Language, vol. 21(4), pp. 620–640, 2007. URL

Kubanek M., Bobulski J., Adrjanowicz L.: Lip tracking method for the system of audio-visual polish speech recognition. In: Artificial Intelligence and Soft Com- puting, pp. 535–542. Springer, 2012.

Lan Y., Theobald B.J., Harvey R., Ong E.J., Bowden R.: Improving visual fe- atures for lip-reading. In: AVSP, pp. 7–3. 2010.

Luettin J., Thacker N.A.: Speechreading using probabilistic models. In: Computer Vision and Image Understanding, vol. 65(2), pp. 163–178, 1997.

Marcheret E., Libal V., Potamianos G.: Dynamic Stream Weight Modeling for Audio-Visual Speech Recognition. In: Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on, vol. 4, pp. IV–945–IV– 948. 2007. ISSN 1520-6149. URL 367227.

Matthews I., Potamianos G., Neti C., Luettin J.: A comparison of model and transform-based visual features for audio-visual LVCSR. In: null, p. 210. IEEE, 2001.

McCowan I., Gatica-Perez D., Bengio S., Lathoud G., Barnard M., Zhang D.: Automatic analysis of multimodal group actions in meetings. In: Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 27(3), pp. 305–317, 2005.

McGurk H., MacDonald J.: Hearing lips and seeing voices. In: , 1976.

Messer K., Kittler J., Sadeghi M., Marcel S., Marcel C., Bengio S., Cardinaux F., Sanderson C., Czyz J., Vandendorpe L., et al.: Face verification competition on the XM2VTS database. In: Audio-and Video-Based Biometric Person Authenti- cation, pp. 964–974. Springer, 2003.

Minka T., Winn J., Guiver J., Webster S., Zaykov Y., Yangel B., Spen- gler A., Bronskill J.: Infer.NET 2.6, 2014. Microsoft Research Cambridge.

Missaoui O., Frigui H.: Optimal feature weighting for the Continuous HMM. In: Pattern Recognition, 2008. ICPR 2008. 19th International Conference on, pp. 1–4. IEEE, 2008.

Mroueh Y., Marcheret E., Goel V.: Deep Multimodal Learning for Audio-Visual Speech Recognition. In: arXiv preprint arXiv:1501.05396, 2015.

Murphy K.P.: Dynamic bayesian networks: representation, inference and learning. Ph.D. thesis, University of California, Berkeley, 2002.

Nakamura S., Ito H., Shikano K.: Stream weight optimization of speech and lip image sequence for audio-visual speech recognition. In: , 2000.

Neti C., Potamianos G., Luettin J., Matthews I., Glotin H., Vergyri D., Sison J., Mashari A.: Audio visual speech recognition. Tech. rep., IDIAP, 2000.

Newman J.L., Cox S.J.: Automatic visual-only language identification: A preli- minary study. In: Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. IEEE International Conference on, pp. 4345–4348. IEEE, 2009.

Ngiam J., Khosla A., Kim M., Nam J., Lee H., Ng A.Y.: Multimodal deep lear- ning. In: Proceedings of the 28th international conference on machine learning (ICML-11), pp. 689–696. 2011.

Noda K., Yamaguchi Y., Nakadai K., Okuno H.G., Ogata T.: Audio-visual speech recognition using deep learning. In: Applied Intelligence, vol. 42(4), pp. 722–737, 2015.

Palecek K., Chaloupka J.: Audio-visual speech recognition in noisy audio envi- ronments. In: Telecommunications and Signal Processing (TSP), 2013 36th In- ternational Conference on, pp. 484–487. IEEE, 2013.

Papandreou G., Maragos P.: Adaptive and constrained algorithms for inverse compositional active appearance model fitting. In: Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pp. 1–8. IEEE, 2008.

Petajan E.D.: Automatic lipreading to enhance speech recognition (speech re- ading). Ph.D. thesis, University of Illinois at Urbana-Champaign, 1984.

Potamianos G., Graf H.P.: Discriminative training of HMM stream exponents for audio-visual speech recognition. In: Acoustics, Speech and Signal Processing, 1998. Proceedings of the 1998 IEEE International Conference on, vol. 6, pp. 3733–3736. IEEE, 1998.

Potamianos G., Neti C.: Improved ROI and within frame discriminant featu- res for lipreading. In: Image Processing, 2001. Proceedings. 2001 International Conference on, vol. 3, pp. 250–253. IEEE, 2001.

Potamianos G., Neti C., Gravier G., Garg A., Senior A.: Recent advances in the automatic recognition of audiovisual speech. In: Proceedings of the IEEE, vol. 91(9), pp. 1306–1326, 2003. ISSN 0018-9219. URL 1109/JPROC.2003.817150.

Rabiner L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. In: Proceedings of the IEEE, vol. 77(2), pp. 257–286, 1989.

Rosenblum L.D., Saldana H.M.: Time-varying information for visual speech per- ception. In: Hearing by eye II, pp. 61–81, 1998.

Saenko K., Livescu K.: An asynchronous DBN for audio-visual speech recognition. In: Spoken Language Technology Workshop, 2006. IEEE, pp. 154–157. 2006. URL

Schwartz J.L., Robert-Ribes J., Escudier P.: Ten years after Summerfield: a ta- xonomy of models for audio-visual fusion in speech perception. In: Hearing by eye II: Advances in the psychology of speechreading and auditory-visual speech, pp. 85–108, 1998.

Shivappa S.T., Trivedi M.M., Rao B.D.: Audiovisual information fusion in human–computer interfaces and intelligent environments: A survey. In: Proce- edings of the IEEE, vol. 98(10), pp. 1692–1715, 2010.

Summerfield A.Q.: Some preliminaries to a comprehensive account of audio-visual speech perception. In: B. Dodd, R. Campbell, eds., Hearing by eye: The psycho- logy of lip-reading. Lawrence Erlbaum Associates, Inc, 1987.

Teissier P., Robert-Ribes J., Schwartz J.L., Gu ́erin-Dugu ́e A.: Comparing models for audiovisual fusion in a noisy-vowel recognition task. In: Speech and Audio Processing, IEEE Transactions on, vol. 7(6), pp. 629–642, 1999.

Tremain T.E.: The government standard linear predictive coding algorithm: LPC- 10. In: Speech Technology, vol. 1(2), pp. 40–49, 1982.

Viola P., Jones M.: Rapid object detection using a boosted cascade of simple features. In: Computer Vision and Pattern Recognition, 2001. CVPR 2001. Pro- ceedings of the 2001 IEEE Computer Society Conference on, vol. 1, pp. I–511. IEEE, 2001.

Young S., Evermann G., Gales M., Hain T., Kershaw D., Liu X.A., Moore G., Odell J., Ollason D., Povey D., et al.: The HTK book (for HTK version 3.4). In: , 2006.

Ziółko M., Gałka J., Ziółko B., Jadczyk T., Skurzok D., Masior M.: Automatic speech recognition system dedicated for Polish. In: Proceedings of Interspeech, Florence, 2011.

Ziółko M., Ziółko B., Skurzok D.: Ortfon2 - tool for orthographic to phonetic transcription, 2015. 7 (th) Language and Technology Conference, Poznan.



  • There are currently no refbacks.