“Cuáles son las creencias que tienes en la cabeza,” Diary of a CEO host Steven Barlett asks Doctor Joe Dispenza on a recent episode, “y que te da miedo compartir?” He’s asking about beliefs that he’s afraid to share - but neither the host of the hit podcast nor his guest was speaking Spanish at the time. Rather, Spotify has translated the show using artificial intelligence to recreate Bartlett’s voice in Spanish.
It’s part of an ambitious new push by the streaming giant to reach non-English speaking audiences with some of its biggest podcasts.
The pilot is working with a handful of major podcasters like Bartlett and The Ringer’s Bill Simmons to translate shows into Spanish, with German and French versions to follow, using AI-generated voices that sound just the like real host. Spotify is using technology from OpenAI, the company behind ChatGPT, and working with the podcasts’ teams to train AI models that generate versions of the shows in a different tongue.
It’s easy to see why; Spanish is the second most popular global language in terms of number of native speakers, and Spanish-speaking countries are avid podcast consumers. According to a Statista survey, 40% of Mexicans and 37% of Spaniards are podcast listeners, representing significant markets for potential expansion. As AI continues to evolve rapidly, audio translations and voice clones could be the next frontier for podcasts seeking to break into these markets.
Charles Rossy, lead data scientist for Diary of a CEO, told PodPod that the technology could make some of the world’s biggest English-speaking podcasts available to even wider audiences. Rossy is French and one of his colleagues is Spanish. Neither of their parents speak English, but they would still like to share the show they work on with their families.
Prior to working with Spotify, Rossy was translating episodes for DOAC on YouTube in-house. He said that while AI translation and voice cloning is a new phenomenon in podcasting, the various steps involved in the process have been around for years. It’s now a matter of stitching it all together in a cohesive way. This involves transcribing the show in English, translating the resulting text, and then using that output to generate audio in the new language. Some of the steps are similar to traditional dubbing, but the missing puzzle piece that has been unlocked by recent advancements is cloning the voice of the host.
Here the AI trains itself on recorded samples of a speaker’s voice, picking up their tone and cadence, to generate a voice model that mimics those subtle speech patterns and characteristics.
“You're going to feed a lot of videos of Steven with his own voice from the past so the algorithm understands the words, the tones, when he's emotional, when he's not. Then we train it, train it, train it and at this moment you'll be able to replicate the voice of that person,” Rossy says.
The podcast provides a wealth of training data for Bartlett’s voice, via its library of 400-plus episodes, but Rossy adds that the 90 minutes to two hours of each show is enough to replicate the guest’s voice too.
Humans in the loop
However, while the technology is impressive, it’s not yet foolproof, according to John Tinsley, vice-president of AI solutions at professional translation provider Translated, who has been working in the machine learning and language translation space for years. He says that these AI models are “super state of the art but they make mistakes” and stresses that the technology still needs human moderation.
“If you had no human intervention in this and you just literally put the audio in and let the automation generate the audio [in a different language], it would not be perfect,” he explains. “That's the challenge to scaling this: it’s quality.”
“To achieve equal quality in the output language, you'll need to have people involved, linguists or translators or someone who is reviewing the outputs at the different steps.”
Real-world speech is not simple and podcasts present unique challenges for translating. There is often more than one person speaking. People speak over each other, they don’t speak in grammatically perfect English with clear punctuation, they pause and repeat themselves, and use slang or colloquialisms that make translations less than straightforward.
That is a challenge for the speech recognition side of the house, but all the steps are interconnected. A poor transcription of the original English will create a clunky translation to the new language and ultimately a muddled slew of words spoken by the host’s voice clone.
Tinsley says that in the case of a podcast there isn’t much room for error, due to the risk of mispresenting what someone said, which could feed the spread of misinformation, as well as presenting legal risks of defamation.
Another start-up that is pushing hard in this space is ElevenLabs. The US company’s profile shot up recently when it began working with New York City Mayor Eric Adams to generate Spanish and Mandarin versions of his voice to make robocalls to his constituents. It too works on podcasts, and creates voice clones for clients that can speak in various languages, but said it is wary of errors.
“We’re confident that our model translates content to a high degree of accuracy, but there is always room for improvement and we’re continually tweaking and working to improve the accuracy of our multilingual and translation tools,” an ElevenLabs spokesperson said.
“We encourage all users to check their recordings before publication and to label all AI-generated content as such, so listeners can be aware.”
Concerns have also been raised over the ethical and legal implications of voice cloning technology; UK performers’ union Equity has warned of potential risks to its members’ livelihood from companies recording a performance once and then translating it into multiple languages without adequate compensation for the actor, while a voice synthesis company drew criticism earlier this year when it launched a podcast using voice clones of Steve Jobs and Joe Rogan as a PR stunt to promote its technology.
Spotify did not respond to a request for comment on how it moderates translations.
Patchy performance
While the technology is scalable for larger audiences, according to industry experts, it still need refining and AI translations still need to train on a lot of data to mimic a voice to an adequate degree. Rossy says some researchers are working on techniques that can mimic voices based on a 60-second audio sample, but the results are patchy.
“I notice that the accent is not correct. Sometimes [the speaker] is speaking in French with an American accent, and then after it's going to be a Canadian accent, and then it's going to be a Spanish accent - so it's not very accurate yet.”
Rossy says the tech also needs to improve on conveying the same emotion that was communicated in the tone of the original English. Tinsley adds that because AI language translations are commercially driven, podcast platforms will likely focus on languages like Spanish, given the large populations speaking it.
“Languages absolutely are not equal when it comes to these technologies,” he says. “You're unlikely to hear a podcast in Swahili or Georgian or Armenian or something like that. You don't have a lot of listeners in those countries or you’re not going to generate ad revenue from the podcast in those countries.”
“You’re not going to bother because the investment that would be required to get the quality up to scratch and the amount of humans you'd have to put in the loop is not commercially viable.”