Universal Speech Model (USM) – A High-End Speech Model by Google AI

Google researchers recently released an update to their Universal Speech Model (USM) that adds support for 1,000 languages. According to the researchers, this model outperforms OpenAI Whisper in all segments of automation speech recognition. Furthermore, better YouTube captions!

In a breakthrough for automatic speech recognition, researchers from Google have developed a model that can recognize under-represented languages with impressive accuracy. The paper, titled “Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages”, reveals that a large unlabelled multilingual dataset can be used to pre-train the encoder of the model and fine-tune it on a smaller set of labelled data to enable recognition of under-represented languages.

The researchers demonstrated the effectiveness of the pre-trained encoder by fine-tuning it on YouTube Caption’s multilingual speech data. Despite the limited amount of supervised data available, the model achieved a word error rate (WER) of less than 30% on average across 73 languages, a milestone that has never been achieved before. The USM model also outperformed Whisper (large-v2), which was trained with over 400,000 hours of labelled data for 18 languages, with an average 32.7% relative lower WER for all segments of automation speech recognition.

The development of the USM model is a significant step towards the 1,000 Languages Initiative, which aims to build a machine learning model that can support the world’s thousand most-spoken languages for better global inclusivity. The initiative faces the challenge of supporting languages with few speakers or limited available data, and the USM model’s success in recognizing under-represented languages is a crucial breakthrough in addressing this challenge.

Leave a Comment Cancel Reply