Learn about Microsoft’s latest research in artificial intelligence for text-to-speech – news portal

by time news

Ars Technica reported that Microsoft showed off its latest research into text-to-speech artificial intelligence using a model called VALL-E that can simulate a person’s voice from just a three-second audio sample, engadget reports.


Speech can match not only the timbre but also the emotional tone of the speaker, and even the acoustics of a room, and could one day be used for custom or high-end text-to-speech applications, though like deepfakes, it carries risks of abuse.

VALL-E is what Microsoft calls a “neural coding language paradigm.” It’s derived from Meta’s AI-powered compression neural network coding, which generates audio from text input and short samples from the target speaker.


In a paper, the researchers describe how they trained VALL-E on 60,000 hours of English speech from more than 7,000 speakers in the LibriLight Meta audio library. It uses training data to infer what the target speaker would sound like if they were speaking by entering the required text.


And the team explains exactly how this works nicely on the VALL-E Github page. For each phrase they want the AI ​​to “speak”, they have a three-second prompt from the speaker to imitate, a “baseline fact” of the same speaker saying another phrase for comparison, a “baseline” for traditional text-to-speech synthesis and a VALL-E sample at the end.


The results are mixed, some machine-like and others surprisingly realistic, the fact that it retains the emotional tone of the original samples, and it faithfully matches the acoustic environment, so if a speaker records his voice in an echo-y hall, the output of VALL-E also sounds like it came from the same Place.


To improve the model, Microsoft plans to extend its training data to “improve model performance across perspectives of similarity between technical presentations, speaking style, and speakers.” It also explores ways to reduce unclear or missing words.


Microsoft chose not to make the code open source, perhaps because of the inherent risks of artificial intelligence that could put words in someone’s mouth.


It added that it would follow “Microsoft principles of artificial intelligence” in any further development. “Because VALL-E can synthesize speech that preserves the speaker’s identity,” the company wrote in the “Wider Implications” section of its conclusion, it may carry potential risks in model abuse, such as Voice recognition impersonation or impersonation.”

You may also like

Leave a Comment