Devin Caldeway reports via TechCrunch: Amazon researchers have trained the largest text-to-speech model ever. This model claims to exhibit “emergent” properties that improve the ability to speak naturally even in complex sentences. A breakthrough could be just what technology needs to get out of the uncanny valley. These models are constantly growing and improving, but the researchers specifically hoped to see the kind of jump in power that was observed when language models got beyond a certain size. For unknown reasons, as LLMs grow beyond a certain point, they become more robust and versatile, capable of performing tasks for which they were not trained. This doesn't mean they're gaining sentience or anything, it's just that their hockey-stick performance on certain conversational AI tasks is beyond a certain point. The team at Amazon AGI (whose ambitions are a secret) thought the same thing might happen as text-to-speech models grow. And their research suggests that this is indeed the case.
The new model is called Big Adaptive Streamable TTS with Emergent capabilities, which they have twisted into the abbreviation BASE TTS. The largest version of the model uses 100,000 hours of public domain audio, 90% of which is in English and the rest in German, Dutch, and Spanish. With 980 million parameters, BASE-large is probably the largest model in this category. They also trained 400M and 150M parametric models based on 10,000 and 1,000 hours of audio, respectively, for comparison. So the idea is that if one of these models exhibits emergent behaviors and another doesn't, that gives us a range of where those behaviors occur. It starts to appear. As it turned out, the midsize model represented the leap in capability the team was looking for. Not necessarily in the usual voice quality (which is better reviewed, but only by a few points), but in a set of new capabilities that the team observed and measured. . Here are some examples of tricky texts mentioned within papers:
– Compound noun: The Beckhams decided to rent a quaint country cottage with a charming stone structure.
– Emotions: “Oh my god! Are we really going to the Maldives? I can't believe it!” Jenny squealed, bouncing on her toes in unbridled glee.
– Foreign words: “Henry, famous for his mise en place, has designed a seven-course meal with each dish being a resistance.
– Paralinguistics (i.e. non-words that can be read): “Shh, Lucy, shh, don't wake up your baby brother,” Tom whispered as he tiptoed past the nursery.
– Punctuation: She received a strange email from her brother that said, “There's an emergency at home.'' Please call us as soon as possible! Mom and dad are worried… #family issues. ”
– question: But questions about Brexit remain. After all the trials and tribulations, will the ministers be able to find the answer in time?
-Syntax complexity: De Moya, who recently won a Lifetime Achievement Award, starred in the 2022 film, which received mixed reviews but was a huge hit at the box office. Here you can read more examples of these difficult sentences spoken naturally.