MICROSOFT
Microsoft VALL-E: simulates a real person's voice based on a voice sample of just three seconds
Microsoft researchers announced the VALL-E text-to-speech AI model, which can simulate a real person's voice based on a voice sample of just three seconds. In this way, maintaining the characteristic tones of the speaker, it produces any text and audio material, as if the speech of a certain person were audible. Its creators envision its use as an advanced text reading and editing application, even in combination with other generative AI models such as GPT-3 which generates text.
The Redmond company refers to VALL-E as a neural language model, based on a compression neural network called EnCodec that Meta announced last year. In contrast to other text-to-speech processes that work by manipulating waveforms, Microsoft's audio codec creates codes from given text and sampled acoustic signals.
VALL-E basically analyzes the speech characteristics of a given person, splits the information with EnCodec into separate components, “acoustic tokens”, to create the final waveform. In addition to mimicking the speaker's tone, it can also mimic the “acoustic environment” of the sound sample. For example, if the sample is cut from a phone call, it will reproduce the acoustic and frequency characteristics of the phone call.
The Redmond researchers worked with the audio library provided by Meta, which contains more than 60,000 hours of English speech by more than 7,000 people. Since, for VALL-E to create high-quality and realistic content, the sound sample must show a high correspondence with one of the data used for training, so the database is planned to be expanded with additional data in the future. .
Due to abuse, Microsoft does not make the test or VALL-E code available to third parties at this time. According to its announcement, the company will follow its own guidelines for AI-related developments in the future, and a separate model is being prepared to determine whether a sound clip was created with the help of VALL-E. Currently on the project's GitHub page you can hear how the algorithm makes music: it's still not perfect, and certain clips sound like machines, but there are also eerily realistic results.
by: Dömös Zsuzsanna
No comments:
Post a Comment