Pingchuan Ma 1      Stavros Petridis 1,2      Maja Pantic 1,2

1 imperial college london       2 meta ai, [paper]         [code]         [model].

Help | Advanced Search

Computer Science > Computer Vision and Pattern Recognition

Title: visual speech recognition for multiple languages in the wild.

Abstract: Visual speech recognition (VSR) aims to recognize the content of speech based on lip movements, without relying on the audio stream. Advances in deep learning and the availability of large audio-visual datasets have led to the development of much more accurate and robust VSR models than ever before. However, these advances are usually due to the larger training sets rather than the model design. Here we demonstrate that designing better models is equally as important as using larger training sets. We propose the addition of prediction-based auxiliary tasks to a VSR model, and highlight the importance of hyperparameter optimization and appropriate data augmentations. We show that such a model works for different languages and outperforms all previous methods trained on publicly available datasets by a large margin. It even outperforms models that were trained on non-publicly available datasets containing up to to 21 times more data. We show, furthermore, that using additional training data, even in other languages or with automatically generated transcriptions, results in further improvement.

Submission history

Access paper:.

  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Published: 24 October 2022

Visual speech recognition for multiple languages in the wild

  • Pingchuan Ma   ORCID: orcid.org/0000-0003-3752-0803 1 ,
  • Stavros Petridis 1 , 2 &
  • Maja Pantic 1 , 2  

Nature Machine Intelligence volume  4 ,  pages 930–939 ( 2022 ) Cite this article

2039 Accesses

53 Citations

56 Altmetric

Metrics details

  • Computational biology and bioinformatics
  • Computer science
  • Human behaviour

A preprint version of the article is available at arXiv.

Visual speech recognition (VSR) aims to recognize the content of speech based on lip movements, without relying on the audio stream. Advances in deep learning and the availability of large audio-visual datasets have led to the development of much more accurate and robust VSR models than ever before. However, these advances are usually due to the larger training sets rather than the model design. Here we demonstrate that designing better models is equally as important as using larger training sets. We propose the addition of prediction-based auxiliary tasks to a VSR model, and highlight the importance of hyperparameter optimization and appropriate data augmentations. We show that such a model works for different languages and outperforms all previous methods trained on publicly available datasets by a large margin. It even outperforms models that were trained on non-publicly available datasets containing up to to 21 times more data. We show, furthermore, that using additional training data, even in other languages or with automatically generated transcriptions, results in further improvement.

This is a preview of subscription content, access via your institution

Access options

Access Nature and 54 other Nature Portfolio journals

Get Nature+, our best-value online-access subscription

24,99 € / 30 days

cancel any time

Subscribe to this journal

Receive 12 digital issues and online access to articles

111,21 € per year

only 9,27 € per issue

Buy this article

  • Purchase on SpringerLink
  • Instant access to full article PDF

Prices may be subject to local taxes which are calculated during checkout

visual speech recognition github

Similar content being viewed by others

visual speech recognition github

Two-stage visual speech recognition for intensive care patients

visual speech recognition github

Enhancing human computer interaction with coot optimization and deep learning for multi language identification

visual speech recognition github

Speech recognition using an english multimodal corpus with integrated image and depth information

Data availability.

The datasets used in the current study are available from the original authors on the LRS2 ( https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs2.html ), LRS3 ( https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs3.html ), CMLR ( https://www.vipazoo.cn/CMLR.html ), Multilingual ( http://www.openslr.org/100 ) and CMU-MOSEAS ( http://immortal.multicomp.cs.cmu.edu/cache/multilingual ) repositories. Qualitative results and the list of cleaned videos for the training and test sets of CMU-MOSEAS and Multilingual TEDx are available on the authors’ GitHub repository ( https://mpc001.github.io/lipreader.html ).

Code availability

Pre-trained networks and testing code are available on a GitHub repository ( https://mpc001.github.io/lipreader.html ) or at Zenodo 66 under an Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) licence.

Potamianos, G., Neti, C., Gravier, G., Garg, A. & Senior, A. W. Recent advances in the automatic recognition of audiovisual speech. Proc. IEEE 91 , 1306–1326 (2003).

Article   Google Scholar  

Dupont, S. & Luettin, J. Audio-visual speech modeling for continuous speech recognition. IEEE Trans. Multimedia 2 , 141–151 (2000).

Chung, J. S., Senior, A., Vinyals, O. & Zisserman, A. Lip reading sentences in the wild. In Proc. 30th IEEE / CVF Conference on Computer Vision and Pattern Recognition 3444–3453 (IEEE, 2017).

Afouras, T., Chung, J. S., Senior, A., Vinyals, O. & Zisserman, A. Deep audio-visual speech recognition. In IEEE Transactions on Pattern Analysis and Machine Intelligence, 1 (IEEE, 2018); https://doi.org/10.1109/TPAMI.2018.2889052

Shillingford, B. et al. Large-scale visual speech recognition. In Proc. 20th Annual Conference of International Speech Communication Association 4135–4139 (ISCA, 2019).

Serdyuk, D., Braga, O. & Siohan, O. Audio-visual speech recognition is worth 32 × 32 × 8 voxels. In Proc. IEEE Automatic Speech Recognition and Understanding Workshop 796–802 (IEEE, 2021).

Zhang, X. et al. Understanding pictograph with facial features: end-to-end sentence-level lip reading of Chinese. In Proc. 33rd AAAI Conference on Artificial Intelligence 9211–9218 (AAAI, 2019).

Zhao, Y., Xu, R. & Song, M. A cascade sequence-to-sequence model for Chinese Mandarin lip reading. In Proc. 1st ACM International Conference on Multimedia in Asia 1–6 (ACM, 2019).

Ma, S., Wang, S. & Lin, X. A transformer-based model for sentence-level Chinese Mandarin lipreading. In Proc. 5th IEEE International Conference on Data Science in Cyberspace 78–81 (IEEE, 2020).

Ma, P., Petridis, S. & Pantic, M. End-to-end audio-visual speech recognition with conformers. In Proc. 46th IEEE International Conference on Acoustics , Speech and Signal Processing 7613–7617 (IEEE, 2021).

Gulati, A. et al. Conformer: convolution-augmented transformer for speech recognition. In Proc. 21st Annual Conference of International Speech Communication Association 5036–5040 (ISCA, 2020).

Makino, T. et al. Recurrent neural network transducer for audio-visual speech recognition. In Proc. IEEE Automatic Speech Recognition and Understanding Workshop 905–912 (IEEE, 2019).

McGurk, H. & MacDonald, J. Hearing lips and seeing voices. Nature 264 , 746–748 (1976).

Sumby, W. H. & Pollack, I. Visual contribution to speech intelligibility in noise. J. Acoust. Soc. Am. 26 , 212–215 (1954).

Petridis, S., Stafylakis, T., Ma, P., Tzimiropoulos, G. & Pantic, M. Audio-visual speech recognition with a hybrid CTC/attention architecture. In Proc. IEEE Spoken Language Technology Workshop 513–520 (IEEE, 2018).

Yu, J. et al. Audio-visual recognition of overlapped speech for the LRS2 dataset. In Proc. 45th IEEE International Conference on Acoustics , Speech and Signal Processing 6984–6988 (IEEE, 2020).

Yu, W., Zeiler, S. & Kolossa, D. Fusing information streams in end-to-end audio-visual speech recognition. In Proc. 46th IEEE International Conference on Acoustics , Speech and Signal Processing 3430–3434 (IEEE, 2021).

Sterpu, G., Saam, C. & Harte, N. How to teach DNNs to pay attention to the visual modality in speech recognition. IEEE / ACM Trans. Audio Speech Language Process. 28 , 1052–1064 (2020).

Google Scholar  

Afouras, T., Chung, J. S. & Zisserman, A. The conversation: deep audio-visual speech enhancement. In Proc. 19th Annual Conference of International Speech Communication Association 3244–3248 (ISCA, 2018).

Ephrat, A. et al. Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation. ACM Trans. Graph. 37 , 112:1–112:11 (2018).

Yoshimura, T., Hayashi, T., Takeda, K. & Watanabe, S. End-to-end automatic speech recognition integrated with CTC-based voice activity detection. In Proc. 45th IEEE International Conference on Acoustics , Speech and Signal Processing 6999–7003 (IEEE, 2020).

Kim, Y. J. et al. Look who’s talking: active speaker detection in the wild. In Proc. 22nd Annual Conference of International Speech Communication Association 3675–3679 (ISCA, 2021).

Chung, J. S., Huh, J., Nagrani, A., Afouras, T. & Zisserman, A. Spot the conversation: speaker diarisation in the wild. In Proc. 21st Annual Conference of International Speech Communication Association 299–303 (ISCA, 2020).

Denby, B. et al. Silent speech interfaces. Speech Commun. 52 , 270–287 (2010).

Haliassos, A., Vougioukas, K., Petridis, S. & Pantic, M. Lips don’t lie: a generalisable and robust approach to face forgery detection. In Proc. 34th IEEE / CVF Conference on Computer Vision and Pattern Recognition 5039–5049 (IEEE, 2021).

Mira, R. et al. End-to-end video-to-speech synthesis using generative adversarial networks. IEEE Transactions on Cybernetics. 1–13 (IEEE, 2022).

Prajwal, K., Mukhopadhyay, R., Namboodiri, V. P. & Jawahar, C. Learning individual speaking styles for accurate lip to speech synthesis. In Proc. 33rd IEEE / CVF Conference on Computer Vision and Pattern Recognition 13796–13805 (IEEE, 2020).

Dungan, L., Karaali, A. & Harte, N. The impact of reduced video quality on visual speech recognition. In Proc. 25th IEEE International Conference on Image Processing 2560–2564 (IEEE, 2018).

Bear, H. L., Harvey, R., Theobald, B.-J. & Lan, Y. Resolution limits on visual speech recognition. In Proc. 21st IEEE International Conference on Image Processing 1371–1375 (IEEE, 2014).

Geirhos, R. et al. ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. In Proc. 7th International Conference on Learning Representations (OpenReview, 2019).

Cheng, S. et al. Towards pose-invariant lip-reading. In Proc. 45th IEEE International Conference on Acoustics , Speech and Signal Processing 4357–4361 (IEEE, 2020).

Wand, M. & Schmidhuber, J. Improving speaker-independent lipreading with domain-adversarial training. In Proc. 18th Annual Conference of International Speech Communication Association 3662–3666 (ISCA, 2017).

Petridis, S., Wang, Y., Li, Z. & Pantic, M. End-to-end multi-view lipreading. In Proc. 28th British Machine Vision Conference (BMVA, 2017); https://doi.org/10.5244/C.31.161

Bicevskis, K. et al. Effects of mouthing and interlocutor presence on movements of visible vs. non-visible articulators. Can. Acoust. 44 , 17–24 (2016).

Šimko, J., Beňuš, Š. & Vainio, M. Hyperarticulation in Lombard speech: global coordination of the jaw, lips and the tongue. J. Acoust. Soc. Am. 139 , 151–162 (2016).

Ma, P., Petridis, S. & Pantic, M. Investigating the Lombard effect influence on end-to-end audio-visual speech recognition. In Proc. 20th Annual Conference of International Speech Communication Association 4090–4094 (ISCA, 2019).

Petridis, S., Shen, J., Cetin, D. & Pantic, M. Visual-only recognition of normal, whispered and silent speech. In Proc. 43rd IEEE International Conference on Acoustics , Speech and Signal Processing 6219–6223 (IEEE, 2018).

Heracleous, P., Ishi, C. T., Sato, M., Ishiguro, H. & Hagita, N. Analysis of the visual Lombard effect and automatic recognition experiments. Comput. Speech Language 27 , 288–300 (2013).

Efforts to acknowledge the risks of new A.I. technology. New York Times (22 October 2018); https://www.nytimes.com/2018/10/22/business/efforts-to-acknowledge-the-risks-of-new-ai-technology.html

Feathers, T. Tech Companies Are Training AI to Read Your Lips https://www.vice.com/en/article/bvzvdw/tech-companies-are-training-ai-to-read-your-lips (2021).

Liopa. https://liopa.ai . Accessed 24 November 2021.

Crawford, S. Facial recognition laws are (literally) all over the map. Wired (16 December 2019); https://www.wired.com/story/facial-recognition-laws-are-literally-all-over-the-map/

Flynn, S. 13 cities where police are banned from using facial recognition tech. Innovation & Tech Today (18 November 2020); https://innotechtoday.com/13-cities-where-police-are-banned-from-using-facial-recognition-tech/

An update on our use of face recognition. FaceBook (2 November 2021); https://about.fb.com/news/2021/11/update-on-use-of-face-recognition/

Metz, R. Amazon will block police indefinitely from using its facial-recognition software. CNN (18 May 2021); https://edition.cnn.com/2021/05/18/tech/amazon-police-facial-recognition-ban/index.html

Greene, J. Microsoft won’t sell police its facial-recognition technology, following similar moves by Amazon and IBM. Washington Post (11 June 2020) https://www.washingtonpost.com/technology/2020/06/11/microsoft-facial-recognition

Afouras, T., Chung, J. S. & Zisserman, A. LRS3-TED: a large-scale dataset for visual speech recognition. Preprint at https://arxiv.org/abs/1809.00496 (2018).

Zadeh, A. B. et al. CMU-MOSEAS: a multimodal language dataset for Spanish, Portuguese, German and French. In Proc. 2020 Conference on Empirical Methods in Natural Language Processing 1801–1812 (ACL, 2020).

Salesky, E. et al. The multilingual TEDx corpus for speech recognition and translation. In Proc. 22nd Annual Conference of International Speech Communication Association 3655–3659 (ISCA, 2021).

Valk, J. & Alumäe, T. VoxLingua107: a dataset for spoken language recognition. In Proc. IEEE Spoken Language Technology Workshop 652–658 (IEEE, 2021).

Deng, J. et al. RetinaFace: single-stage dense face localisation in the wild. In Proc. 33rd IEEE / CVF Conference on Computer Vision and Pattern Recognition 5203–5212 (IEEE, 2020).

Bulat, A. & Tzimiropoulos, G. How far are we from solving the 2D & 3D face alignment problem? (and a dataset of 230,000 3D facial landmarks). In Proc. 16th IEEE / CVF International Conference on Computer Vision 1021–1030 (IEEE, 2017).

Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale image recognition. In Proc. 3rd International Conference on Learning Representations (OpenReview, 2015).

Assael, Y., Shillingford, B., Whiteson, S. & De Freitas, N. LipNet: end-to-end sentence-level lipreading. Preprint at https://arxiv.org/abs/1611.01599 (2016).

Ma, P., Martinez, B., Petridis, S. & Pantic, M. Towards practical lipreading with distilled and efficient models. In Proc. 46th IEEE International Conference on Acoustics , Speech and Signal Processing 7608–7612 (IEEE, 2021).

Park, D. S. et al. SpecAugment: a simple data augmentation method for automatic speech recognition. In Proc. 20th Annual Conference of International Speech Communication Association 2613–2617 (ISCA, 2019).

Liu, C. et al. Improving RNN transducer based ASR with auxiliary tasks. In Proc. IEEE Spoken Language Technology Workshop 172–179 (IEEE, 2021).

Toshniwal, S., Tang, H., Lu, L. & Livescu, K. Multitask learning with low-level auxiliary tasks for encoder-decoder based speech recognition. In Proc. 18th Annual Conference of International Speech Communication Association 3532–3536 (ISCA, 2017).

Lee, J. & Watanabe, S. Intermediate loss regularization for CTC-based speech recognition. In Proc. 46th IEEE International Conference on Acoustics , Speech and Signal Processing 6224–6228 (IEEE, 2021).

Pascual, S., Ravanelli, M., Serrà, J., Bonafonte, A. & Bengio, Y. Learning problem-agnostic speech representations from multiple self-supervised tasks. In Proc. 20th Annual Conference of International Speech Communication Association 161–165 (ISCA, 2019).

Shukla, A., Petridis, S. & Pantic, M. Learning speech representations from raw audio by joint audiovisual self-supervision. In Proc. 37th International Conference on Machine Learning Workshop (PMLR, 2020).

Ma, P., Mira, R., Petridis, S., Schuller, B. W. & Pantic, M. LiRA: learning visual speech representations from audio through self-supervision. In Proc. 22nd Annual Conference of International Speech Communication Association 3011–3015 (ISCA, 2021).

Serdyuk, D., Braga, O. & Siohan, O. Transformer-Based Video Front-Ends for Audio-Visual Speech Recognition for Single and Muti-Person Video. In Proc. 23rd Annual Conference of International Speech Communication Association 2833–2837 (ISCA, 2022).

Watanabe, S. et al. ESPnet: End-to-end speech processing toolkit. In Proc. 19th Annual Conference of International Speech Communication Association 2207–2211 (ISCA, 2018).

Kingma, D. & Ba, J. Adam: a method for stochastic optimization. In Proc. 2nd International Conference on Learning Representations (OpenReview, 2014).

Ma, P., Petridis, S. & Pantic, M. 2022. mpc001/Visual_Speech_Recognition_for_Multiple_Languages: visual speech recognition for multiple languages. Zenodo https://doi.org/10.5281/zenodo.7065080

Afouras, T., Chung, J. S. & Zisserman, A. ASR is all you need: cross-modal distillation for lip reading. In Proc. 45th IEEE International Conference on Acoustics , Speech and Signal Processing 2143–2147 (IEEE, 2020).

Ren, S., Du, Y., Lv, J., Han, G. & He, S. Learning from the master: distilling cross-modal advanced knowledge for lip reading. In Proc. 34th IEEE / CVF Conference on Computer Vision and Pattern Recognition 13325–13333 (IEEE, 2021).

Zhao, Y. et al. Hearing lips: improving lip reading by distilling speech recognizers. In Proc. 34th AAAI Conference on Artificial Intelligence 6917–6924 (AAAI, 2020).

Download references

Acknowledgements

All training, testing and ablation studies were conducted at Imperial College London.

Author information

Authors and affiliations.

Imperial College London, London, UK

Pingchuan Ma, Stavros Petridis & Maja Pantic

Meta AI, London, UK

Stavros Petridis & Maja Pantic

You can also search for this author in PubMed   Google Scholar

Contributions

The code was written by P.M., and the experiments were conducted by P.M. and S.P. The manuscript was written by P.M., S.P. and M.P. M.P. supervised the entire project.

Corresponding author

Correspondence to Pingchuan Ma .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Peer review

Peer review information.

Nature Machine Intelligence thanks Joon Son Chung, Olivier Siohan and Mingli Song for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary information.

Supplementary text, Fig. 1, Tables 1–28 and references.

Supplementary Video 1

A demo of visual speech recognition for multiple languages.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article.

Ma, P., Petridis, S. & Pantic, M. Visual speech recognition for multiple languages in the wild. Nat Mach Intell 4 , 930–939 (2022). https://doi.org/10.1038/s42256-022-00550-z

Download citation

Received : 22 February 2022

Accepted : 13 September 2022

Published : 24 October 2022

Issue Date : November 2022

DOI : https://doi.org/10.1038/s42256-022-00550-z

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Continuous lipreading based on acoustic temporal alignments.

  • David Gimeno-Gómez
  • Carlos-D. Martínez-Hinarejos

EURASIP Journal on Audio, Speech, and Music Processing (2024)

Audio-guided self-supervised learning for disentangled visual speech representations

  • Shuang Yang

Frontiers of Computer Science (2024)

Research of ReLU output device in ternary optical computer based on parallel fully connected layer

  • Huaqiong Ma

The Journal of Supercomputing (2024)

Sla-former: conformer using shifted linear attention for audio-visual speech recognition

Complex & Intelligent Systems (2024)

3D facial animation driven by speech-video dual-modal signals

  • Zhouzhou Liao

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

visual speech recognition github

pyVSR - Python toolkit for Visual Speech Recognition

Global information.

  • Repository: https://github.com/georgesterpu/pyVSR

Description

pyVSR is a Python toolkit aimed at running Visual Speech Recognition (VSR) experiments in a traditional framework (e.g. handcrafted visual features, Hidden Markov Models for pattern recognition).

The main goal of pyVSR is to easily reproduce VSR experiments in order to have a baseline result on most publicly available audio-visual datasets.

What can you do with pyVSR:

  • speaker-dependent protocol
  • speaker-independent protocol
  • single person
  • Automatic ROI extraction
  • Configurable window size
  • Fourth order accurate derivatives
  • Sample rate interpolation
  • Storage in HDF5 format
  • Do NOT require manually annotated landmarks
  • Face, lips, and chin models supported
  • Parameters obtainable either through fitting or projection
  • Implementation based on Menpo
  • OpenFace wrapper
  • easy HTK wrapper for Python
  • optional bigram language model
  • multi-threaded support (both for training and decoding at full CPU Power)
  • pyVSR has a simple, modular, object-oriented architecture

visual speech recognition github

Subscribe to the PwC Newsletter

Join the community, edit social preview.

visual speech recognition github

Add a new code entry for this paper

Remove a code repository from this paper.

visual speech recognition github

Mark the official implementation from paper authors

Add a new evaluation result row.

  • AUDIO-VISUAL SPEECH RECOGNITION
  • AUTOMATIC SPEECH RECOGNITION (ASR)
  • LIP READING
  • SPEECH-RECOGNITION
  • SPEECH RECOGNITION
  • VISUAL SPEECH RECOGNITION

Remove a task

visual speech recognition github

Add a method

Remove a method.

Include the markdown at the top of your GitHub README.md file to showcase the performance of the model.

Badges are live and will be dynamically updated with the latest ranking of this paper.

Edit Datasets

Deep audio-visual speech recognition.

6 Sep 2018  ·  Triantafyllos Afouras , Joon Son Chung , Andrew Senior , Oriol Vinyals , Andrew Zisserman · Edit social preview

The goal of this work is to recognise phrases and sentences being spoken by a talking face, with or without the audio. Unlike previous works that have focussed on recognising a limited number of words or phrases, we tackle lip reading as an open-world problem - unconstrained natural language sentences, and in the wild videos. Our key contributions are: (1) we compare two models for lip reading, one using a CTC loss, and the other using a sequence-to-sequence loss. Both models are built on top of the transformer self-attention architecture; (2) we investigate to what extent lip reading is complementary to audio speech recognition, especially when the audio signal is noisy; (3) we introduce and publicly release a new dataset for audio-visual speech recognition, LRS2-BBC, consisting of thousands of natural sentences from British television. The models that we train surpass the performance of all previous work on a lip reading benchmark dataset by a significant margin.

Code Edit Add Remove Mark official

Tasks edit add remove, datasets edit.

visual speech recognition github

Results from the Paper Edit

visual speech recognition github

Methods Edit Add Remove

Navigation Menu

Search code, repositories, users, issues, pull requests..., provide feedback.

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

  • Notifications You must be signed in to change notification settings

PyTorch implementation of "Watch or Listen: Robust Audio-Visual Speech Recognition with Visual Corruption Modeling and Reliability Scoring" (CVPR2023) and "Visual Context-driven Audio Feature Enhancement for Robust End-to-End Audio-Visual Speech Recognition" (Interspeech 2022)

ms-dot-k/AVSR

Folders and files, repository files navigation, audio-visual speech recognition (avsr) - avrelscore, vcafe.

This repository contains the PyTorch implementation of the following papers:

Watch or Listen: Robust Audio-Visual Speech Recognition with Visual Corruption Modeling and Reliability Scoring (CVPR2023) - AVRelScore Joanna Hong*, Minsu Kim*, Jeongsoo Choi, and Yong Man Ro (*Equal contribution) [ Paper ] [ Demo Video ] Visual Context-driven Audio Feature Enhancement for Robust End-to-End Audio-Visual Speech Recognition (Interspeech 2022) - VCAFE Joanna Hong*, Minsu Kim*, and Yong Man Ro (*Equal contribution) [ Paper ]

visual speech recognition github

Requirements

  • pytorch 1.8 ~ 1.9
  • torchvision
  • tensorboard
  • scikit-image
  • opencv-python
  • albumentations

Preparation

Dataset download.

LRS2/LRS3 dataset can be downloaded from the below link.

  • https://www.robots.ox.ac.uk/~vgg/data/lip_reading/

Landmark Download

For data preprocessing, download the landmark of LRS2 and LRS3 from the repository . (Landmarks for "VSR for multiple languages models")

Occlusion Data Download

For visual corruption modeling, download coco_object.7z from the repository .

Unzip and put the files at

Babble Noise Download

For audio corruption modeling, download babble noise file from here .

put the file at

Pre-trained Frontends

For initializing visual frontend and audio frontend, please download the pre-trained models from the repository . (resnet18_dctcn_audio/resnet18_dctcn_video)

Put the .tar file at

Preprocessing

After download the dataset and landmark, we 1) align and crop the lip centered video, 2) extract audio, 3) obtain aligned landmark. We suppose the data directory is constructed as

Run preprocessing with the following commands:

Training the Model

Basically, you can choice model architecture with the parameter architecture . There are three options for the architecture : AVRelScore , VCAFE , Conformer . To train the model, run following command:

Descriptions of training parameters are as follows:

  • --data_path : Preprocessed Dataset location (LRS2 or LRS3)
  • --data_type : Choose to train on LRS2 or LRS3
  • --split_file : train and validation file lists (you can do curriculum learning by changing the split_file, 0_100.txt consists of files with frames between 0 to 100; training directly on 0_600.txt is also not too bad.)
  • --checkpoint_dir : directory for saving checkpoints
  • --checkpoint : saved checkpoint where the training is resumed from
  • --model_conf : model_configuration
  • --wandb_project : if want to use wandb, please set the project name here.
  • --batch_size : batch size
  • --update_frequency : update_frquency, if you use too small batch_size increase update_frequency. Training batch_size = batch_size * udpate_frequency
  • --epochs : number of epochs
  • --tot_iters : if set, the train is finished at the total iterations set
  • --eval_step : every step for performing evaluation
  • --fast_validate : if set, validation is performed for a subset of validation data
  • --visual_corruption : if set, we apply visual corruption modeling during training
  • --architecture : choose which architecture will be trained. (options: AVRelScore, VCAFE, Conformer)
  • --gpu : gpu number for training
  • --distributed : if set, distributed training is performed
  • Refer to train.py for the other training parameters

check the training logs

The tensorboard shows the training and validation loss, evaluation metrics. Also, if you set wandb_project , you can check wandb log.

Testing the Model

To test the model, run following command:

Descriptions of testing parameters are as follows:

  • --split_file : set to test.ref (./src/data/LRS2./test.ref or ./src/data/LRS3/test.ref)
  • --checkpoint : model for testing
  • --rnnlm : language model checkpoint
  • --rnnlm_conf : language model configuration
  • --beam_size : beam size
  • --ctc_weight : ctc weight for joint decoding
  • --lm_weight : language model weight for decoding
  • Refer to test.py for the other parameters

Pre-trained model checkpoints

We release the pre-trained AVSR models (VCAFE and AVRelScore) on LRS2 and LRS3 datasbases. (Below WERs can be obtained at beam_width : 40, ctc_weight : 0.1, lm_weight : 0.5)

You can find the pre-trained Language Model in the following repository . Put the language model at

Testing under Audio-Visual Noise Condition

Please refer to the following repository for making the audio-visual corrupted dataset.

Acknowledgment

The code are based on the following two repositories, ESPNet and VSR for Multiple Languages .

If you find this work useful in your research, please cite the papers:

  • Python 100.0%

Run PyTorch locally or get started quickly with one of the supported cloud platforms

Whats new in PyTorch tutorials

Familiarize yourself with PyTorch concepts and modules

Bite-size, ready-to-deploy PyTorch code examples

Master PyTorch basics with our engaging YouTube tutorial series

Learn about the tools and frameworks in the PyTorch Ecosystem

Join the PyTorch developer community to contribute, learn, and get your questions answered.

A place to discuss PyTorch code, issues, install, research

Find resources and get questions answered

Award winners announced at this year's PyTorch Conference

Build innovative and privacy-aware AI experiences for edge devices

End-to-end solution for enabling on-device inference capabilities across mobile and edge devices

Explore the documentation for comprehensive guidance on how to use PyTorch.

Read the PyTorch Domains documentation to learn more about domain-specific libraries.

Catch up on the latest technical news and happenings

Stories from the PyTorch ecosystem

Learn about the latest PyTorch tutorials, new, and more

Learn how our community solves real, everyday machine learning problems with PyTorch

Find events, webinars, and podcasts

Learn more about the PyTorch Foundation.

  • Become a Member

October 10, 2023

Real-time Audio-visual Speech Recognition

by Team PyTorch

Audio-Visual Speech Recognition (AV-ASR, or AVSR) is the task of transcribing text from audio and visual streams, which has recently attracted a lot of research attention due to its robustness to noise. The vast majority of work to date has focused on developing AV-ASR models for non-streaming recognition; studies on streaming AV-ASR are very limited.

We have developed a compact real-time speech recognition system based on TorchAudio, a library for audio and signal processing with PyTorch . It can run locally on a laptop with high accuracy without accessing the cloud. Today, we are releasing the real-time AV-ASR recipe under a permissive open license (BSD-2-Clause license), enabling a broad set of applications and fostering further research on audio-visual models for speech recognition.

This work is part of our approach to AV-ASR research . A promising aspect of this approach is its ability to automatically annotate large-scale audio-visual datasets, which enables the training of more accurate and robust speech recognition systems. Furthermore, this technology has the potential to run on smart devices since it achieves the latency and memory efficiency that such devices require for inference.

In the future, speech recognition systems are expected to power applications in numerous domains. One of the primary applications of AV-ASR is to enhance the performance of ASR in noisy environments. Since visual streams are not affected by acoustic noise, integrating them into an audio-visual speech recognition model can compensate for the performance drop of ASR models. Our AV-ASR system has the potential to serve multiple purposes beyond speech recognition, such as text summarization, translation and even text-to-speech conversion. Moreover, the exclusive use of VSR can be useful in certain scenarios, e.g. where speaking is not allowed, in meetings, and where privacy in public conversations is desired.

Fig. 1 The pipeline for audio-visual speech recognition system

Fig. 1 : The pipeline for audio-visual speech recognition system

Our real-time AV-ASR system is presented in Fig. 1. It consists of three components, a data collection module, a pre-processing module and an end-to-end model. The data collection module comprises hardware devices, such as a microphone and camera. Its role is to collect information from the real world. Once the information is collected, the pre-processing module location and crop out face. Next, we feed the raw audio stream and the pre-processed video stream into our end-to-end model for inference.

Data collection

We use torchaudio.io.StreamReader to capture audio/video from streaming device input, e.g. microphone and camera on laptop. Once the raw video and audio streams are collected, the pre-processing module locates and crops faces. It should be noted that data is immediately deleted during the streaming process.

Pre-processing

Before feeding the raw stream into our model, each video sequence has to undergo a specific pre-processing procedure. This involves three critical steps. The first step is to perform face detection. Following that, each individual frame is aligned to a referenced frame, commonly known as the mean face, in order to normalize rotation and size differences across frames. The final step in the pre-processing module is to crop the face region from the aligned face image. We would like to clearly note that our model is fed with raw audio waveforms and pixels of the face, without any further preprocessing like face parsing or landmark detection. An example of the pre-processing procedure is illustrated in Table 1.

Table 1 : Preprocessing pipeline.

Fig. 2 The architecture for the audio-visual speech recognition system.

Fig. 2 : The architecture for the audio-visual speech recognition system

We consider two configurations: Small with 12 Emformer blocks and Large with 28, with 34.9M and 383.3M parameters, respectively. Each AV-ASR model composes front-end encoders, a fusion module, an Emformer encoder, and a transducer model. To be specific, we use convolutional frontends to extract features from raw audio waveforms and facial images. The features are concatenated to form 1024-d features, which are then passed through a two-layer multi-layer perceptron and an Emformer transducer model. The entire network is trained using RNN-T loss. The architecture of the proposed AV-ASR model is illustrated in Fig. 2.

Datasets. We follow Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels to use publicly available audio-visual datasets including LRS3 , VoxCeleb2 and AVSpeech for training. We do not use mouth ROIs or facial landmarks or attributes during both training and testing stages.

Comparisons with the state-of-the-art. Non-streaming evaluation results on LRS3 are presented in Table 2. Our audio-visual model with an algorithmic latency of 800 ms (160ms+1280msx0.5) yields a WER of 1.3%, which is on par with those achieved by state-of-the-art offline models such as AV-HuBERT, RAVEn, and Auto-AVSR.

Table 2 : Non-streaming evaluation results for audio-visual models on the LRS3 dataset.

Noisy experiments. During training, 16 different noise types are randomly injected to audio waveforms, including 13 types from Demand database, ‘DLIVING’,’DKITCHEN’, ‘OMEETING’, ‘OOFFICE’, ‘PCAFETER’, ‘PRESTO’, ‘PSTATION’, ‘STRAFFIC’, ‘SPSQUARE’, ‘SCAFE’, ‘TMETRO’, ‘TBUS’ and ‘TCAR’, two more types of noise from speech commands database, white and pink and one more type of noise from NOISEX-92 database, babble noise. SNR levels in the range of [clean, 7.5dB, 2.5dB, -2.5dB, -7.5dB] are selected from with a uniform distribution. Results of ASR and AV-ASR models, when tested with babble noise, are shown in Table 3. With increasing noise level, the performance advantage of our audio-visual model over our audio-only model grows, indicating that incorporating visual data improves noise robustness.

Table 3 : Streaming evaluation WER (%) results at various signal-to-noise ratios for our audio-only (A) and audio-visual (A+V) models on the LRS3 dataset under 0.80-second latency constraints.

Real-time factor . The real-time factor (RTF) is an important measure of a system’s ability to process real-time tasks efficiently. An RTF value of less than 1 indicates that the system meets real-time requirements. We measure RTF using a laptop with an Intel® Core™ i7-12700 CPU running at 2.70 GHz and an NVIDIA 3070 GeForce RTX 3070 Ti GPU. To the best of our knowledge, this is the first AV-ASR model that reports RTFs on the LRS3 benchmark. The Small model achieves a WER of 2.6% and an RTF of 0.87 on CPU (Table 4), demonstrating its potential for real-time on-device inference applications.

Table 4 : Impact of AV-ASR model size and device on WER and RTF. Note that the RTF calculation includes the pre-processing step wherein the Ultra-Lightweight Face Detection Slim 320 model is used to generate face bounding boxes.

Learn more about the system from the published works below:

  • Shi, Yangyang, Yongqiang Wang, Chunyang Wu, Ching-Feng Yeh, Julian Chan, Frank Zhang, Duc Le, and Mike Seltzer. “Emformer: Efficient memory transformer based acoustic model for low latency streaming speech recognition.” In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6783-6787. IEEE, 2021.
  • Ma, Pingchuan, Alexandros Haliassos, Adriana Fernandez-Lopez, Honglie Chen, Stavros Petridis, and Maja Pantic. “Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels.” In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1-5. IEEE, 2023.

Access comprehensive developer documentation for PyTorch

Get in-depth tutorials for beginners and advanced developers

Find development resources and get your questions answered

  • Get Started
  • Learn the Basics
  • PyTorch Recipes
  • Introduction to PyTorch - YouTube Series
  • Developer Resources
  • Contributor Awards - 2024
  • About PyTorch Edge
  • PyTorch Domains
  • Blog & News
  • PyTorch Blog
  • Community Blog
  • Community Stories
  • PyTorch Foundation
  • Governing Board
  • Cloud Credit Program
  • Technical Advisory Council

To analyze traffic and optimize your experience, we serve cookies on this site. By clicking or navigating, you agree to allow our usage of cookies. As the current maintainers of this site, Facebook’s Cookies Policy applies. Learn more, including about available controls: Cookies Policy .

IMAGES

  1. GitHub

    visual speech recognition github

  2. GitHub

    visual speech recognition github

  3. Research suggests improvements in visual speech recognition on a multilingual dataset

    visual speech recognition github

  4. Real-time Audio-visual Speech Recognition

    visual speech recognition github

  5. Automatic visual speech recognition system

    visual speech recognition github

  6. Visual Speech Recognition

    visual speech recognition github

VIDEO

  1. Lip Reading Live Read Demo

  2. Windows Speech Recognition Macros

  3. Deep Learning for End-to-End Audio-Visual Speech Recognition, Dr. Stavros Petridis

  4. High-performance OpenAI's Whisper speech recognition

  5. Real-Time Emotions detection using CNN

  6. Technical Seminar- Deep Audio-Visual Speech Recognition

COMMENTS

  1. Visual Speech Recognition for Multiple Languages - GitHub

    We provide a tutorial to show how to use our Auto-AVSR models to perform speech recognition (ASR, VSR, and AV-ASR), crop mouth ROIs or extract visual speech features.

  2. visual-speech-recognition · GitHub Topics · GitHub

    Visual speech recognition with face inputs: code and models for F&G 2020 paper "Can We Read Speech Beyond the Lips? Rethinking RoI Selection for Deep Visual Speech Recognition"

  3. Deep Audio-Visual Speech Recognition - GitHub

    Deep Audio-Visual Speech Recognition. The repository contains a PyTorch reproduction of the TM-CTC model from the Deep Audio-Visual Speech Recognition paper. We train three models - Audio-Only (AO), Video-Only (VO) and Audio-Visual (AV), on the LRS2 dataset for the speech-to-text transcription task.

  4. Visual Speech Recognition for Multiple Languages in the Wild

    Visual speech recognition (VSR) aims to recognise the content of speech based on the lip movements without relying on the audio stream. Advances in deep learning and the availability of large audio-visual datasets have led to the development of much more accurate and robust VSR models than ever before.

  5. Visual Speech Recognition for Multiple Languages in the Wild

    Visual speech recognition (VSR) aims to recognize the content of speech based on lip movements, without relying on the audio stream. Advances in deep learning and the availability of large audio-visual datasets have led to the development of much more accurate and robust VSR models than ever before.

  6. Visual speech recognition for multiple languages in the wild

    Visual speech recognition (VSR) aims to recognize the content of speech based on lip movements, without relying on the audio stream. Advances in deep learning and the availability of large...

  7. pyVSR - Python toolkit for Visual Speech Recognition

    pyVSR is a Python toolkit aimed at running Visual Speech Recognition (VSR) experiments in a traditional framework (e.g. handcrafted visual features, Hidden Markov Models for pattern recognition).

  8. Deep Audio-Visual Speech Recognition - Papers With Code

    Deep Audio-Visual Speech Recognition. 6 Sep 2018 · Triantafyllos Afouras, Joon Son Chung, Andrew Senior, Oriol Vinyals, Andrew Zisserman ·. Edit social preview. The goal of this work is to recognise phrases and sentences being spoken by a talking face, with or without the audio.

  9. Audio-Visual Speech Recognition (AVSR) - AVRelScore ... - GitHub

    PyTorch implementation of "Watch or Listen: Robust Audio-Visual Speech Recognition with Visual Corruption Modeling and Reliability Scoring" (CVPR2023) and "Visual Context-driven Audio Feature Enhancement for Robust End-to-End Audio-Visual Speech Recognition" (Interspeech 2022) - ms-dot-k/AVSR.

  10. Real-time Audio-visual Speech Recognition - PyTorch

    Audio-Visual Speech Recognition (AV-ASR, or AVSR) is the task of transcribing text from audio and visual streams, which has recently attracted a lot of research attention due to its robustness to noise.