The voice and speech synthesis industry is rapidly growing and developing, and its volumes already exceed $ 1 billion. The wide applications of such technologies are obvious: voice assistants, talking robots, radio and television programs, dubbing books and films with the voices of famous people, restoring the voices of those who have died or lost the opportunity to speak, etc. In the past two years, developments in this area have noticeably intensified during the pandemic.
Voice synthesis technologies have existed for a long time, but until the early 2010s, such voices sounded mechanical. With the development of technology and AI, it became possible to literally decompose the human voice “into atoms”, capture all its characteristics and nuances and create a voice that does not belong to a person, but sounds absolutely human, as well as synthesize the voices of specific people.
Specialists in the synthesis (cloning) of the human voice explain that teaching a computer to speak like a person is not at all easy: the human voice has many different characteristics. “To analyze the human voice, you need to know a lot about acoustics, the principles of speech sound, you need to understand the physiological aspects,” explains Klaus Scherer, emeritus professor of emotion psychology at the University of Geneva. “So this process always necessarily involves different disciplines, and it requires a lot of plan that it is necessary to master a lot in order to achieve something worthwhile.
When cloning the voice of a particular person, specialists take samples of his speech.
If you want to clone the voice of a living person, he is given to recite a large number of very different texts, when reading which a person will be able to demonstrate different emotions with his voice, change intonation, pause, etc.
In total, about an hour of such reading should be recorded, and 10–15 minutes of recording will be taken for the cloning process.
These recordings are loaded into a neural network, which then generates a voice taking into account all possible nuances. The whole process takes less than a week. The output is a voice that is almost indistinguishable from the original. They can pronounce any text that will be entered into the program. This means that the resulting voice can be used for reading audio books, presenting news, announcements, for an alarm clock program that will wake up a person with this voice, for voicing video games and any text content, and a lot of other things.
If the voice of an already deceased person is being cloned, the procedure will be the same. For example, the cloned voice of the famous American chef, writer and TV presenter Anthony Bourdain, who committed suicide in 2018, was used for the documentary film Roadrunner about him, released last summer. To recreate Bourdain’s voice, director Morgan Neville collected tens of thousands of hours of video and audio recordings. Based on this amount of data, the chef’s voice was recreated, which is used in several phrases in the film.
The premiere caused a mixed reaction: someone considered it immoral to use Bourdain’s voice to say what he did not say during his lifetime.
However, although Anthony Bourdain did not utter these phrases during his lifetime, he wrote them.
Recent industry successes include the return of the voice to renowned actor Val Kilmer. In 2015, the actor was diagnosed with throat cancer, and after two years of chemotherapy and a tracheostomy, he almost lost his voice. Last summer, Sonantic recreated the actor’s voice using AI technology. And with this voice, he spoke about his illness, its consequences, and that, despite this loss, he remains the same creative person who constantly comes up with something and is full of ideas. “Now I can express myself again,” says Val Kilmer in this video. “I can show you my dreams and rediscover this part of me. The part that didn’t really go anywhere – it just hid.”
According to experts, voice cloning would help, for example, directors when dubbing films and save time for actors who would not have to sit in the studio for a long time. Cloning would also be useful in cases where the actor died in the process of making a film or cannot complete the project, but it is possible to use his voice.
Experts predict that a wide range of cloned voice rental services will soon appear, when famous people will be able to “rent” their voices to voice some content, and this will be another good source of income for celebrities.
Not long to wait and the time when this technology will be able to use the average consumer. For example, an application will appear that will read a book to a child in the voice of his mother, father, grandmother. In video games, for example, players will be able to give the characters their own voices.
Now, in the process of cloning a voice for a specific purpose, specialists do not always use all the collected emotions and intonations of a particular voice. As Fati Yassa, founder and CEO of Speech Morphing, told NPR, “The choice depends on where the voice will be used. If in the field of banking, then this is one thing, but reading e-books is completely different, and all this is different from the voice with which a report is read or with which they communicate with the consumer. When recreating a voice, according to Mr. Yassa, it is possible to make its tone apologetic or cheerfully promotional, or it can be done so that it seems as if the owner of this voice is an actor on the stage of the theater. True, experts say, cloned voices have not yet learned to sing. But only for now.
Meanwhile, there are situations when a too human-sounding cloned voice is not needed at all.
For example, if this voice is built into a voice assistant that helps an elderly person cope with loneliness, or reads an audiobook to a child, then the more natural it is, the better – but if a “smart” refrigerator suddenly speaks in such a human voice, the sensations may not be from pleasant ones. “It’s better to use a more robotic voice,” says designer Amy Jimenez Marquez, who worked on Amazon’s Alexa voice assistant for four years. “For cases like this, you can just create a voice with some metallic sound, like a real robot. Still, such a voice is more suitable for the refrigerator.
Given such a wide range of applications for the technology, the voice cloning market is growing quite rapidly. If in 2018 its volume was estimated at $456 million, then by 2020 the size of the market has doubled, and by 2028, according to various forecasts, it can reach almost $5 billion with an annual growth of 24-30%. Dozens of companies around the world are already doing this, from large ones such as Google or IBM to small ones specializing in this technology alone, such as Descript, Veritone, Respeecher, etc.
Due to the wide range of applications and the continuous improvement of voice cloning technology, it has recently been increasingly used by scammers.
The first use of a cloned voice for a crime occurred in March 2019.
The scammers, having generated the voice of the director of a German energy company, called the director of the British division of this company and asked to transfer $243,000 to an allegedly Hungarian supplier. The transfer was successfully completed, the money went first to Hungary, then to Mexico, and then to several other addresses. The identities of those scammers were not revealed.
And in January 2020, in a similar way, it was already possible to withdraw $35 million from several structures in the UAE and transfer them in parts to banks in several countries of the world, including the United States. According to Forbes, during the investigation, the UAE authorities turned to the United States for help. The details of both investigations are not disclosed, the names, as well as the names of the affected companies and banks, are not indicated.