DeepL launches voice-to-voice translation suite with Zoom and Teams add-ons

This article was generated by AI and cites original sources.

DeepL, known for text translation, has released a voice-to-voice translation suite for real-time use cases such as meetings and customer service. The company is also releasing an API for outside developers and businesses to build customized experiences on top of its voice translation technology, along with platform add-ons for Zoom and Microsoft Teams. The move extends DeepL’s translation capabilities from written content into spoken conversations.

From text to voice translation

DeepL CEO Jarek Kutylowski told TechCrunch that voice was a natural progression after years focused on text and document translation. “After spending so many years in text translation, voice was a natural step for us,” Kutylowski said. “We have come a long way when it comes to text translation and document translation. But we thought there wasn’t a great product for real-time voice translation.”

The company’s voice suite targets multiple interaction contexts: meetings, mobile and web conversations, and group conversations for frontline workers via custom apps. DeepL is also releasing an API designed for customized deployments, with examples including call centers.

Kutylowski highlighted a central engineering challenge for real-time systems: reducing latency (the delay between a user speaking and translated audio playing back) while maintaining accurate results. This constraint is critical for voice translation because user experience depends on how quickly translated speech becomes audible.

Zoom and Teams integration with early access

DeepL’s integration strategy includes add-ons for Zoom and Microsoft Teams. In these tools, listeners can either hear real-time translation while others speak in native languages, or follow real-time translated text on screen. The program is currently in early access, with DeepL inviting organizations to join a waitlist.

The add-ons reflect two different consumption models for translation: one that converts spoken input into translated output audio, and another that displays translation as text. Beyond video conferencing, DeepL also offers a product for mobile and web-based conversations that can take place in person or remotely.

The company also enables group conversations in settings such as training sessions or workshops, where participants can join through a QR code.

Custom vocabulary and voice-to-voice technology

DeepL said its voice-to-voice technology can learn and adapt to custom vocabulary, including industry-specific terms and company and personal names. This capability matters for translation systems because specialized terms and proper nouns often drive errors in both text and speech translation, particularly in domains like customer service, training, and industry operations.

Kutylowski noted that AI is reshaping customer service. He said that a translation layer helps companies provide support in languages where qualified staff are scarce and expensive to hire.

Competitive landscape

DeepL’s launch occurs in a competitive market that includes several companies working in voice and translation. Sanas raised $65 million from Quadrille Capital and Teleperformance and uses AI to modify a speaker’s accent in real time, primarily for call center agents. Camb.AI, a Dubai-based company, focuses on speech synthesis and translation for media and entertainment companies and works with Amazon Web Services to dub and localize video content. Palabra, backed by Seven Seven Six (founded by Reddit co-founder Alexis Ohanian), is building a real-time speech translation engine designed to preserve both meaning and the speaker’s original voice.

What this means for voice translation

DeepL’s launch highlights several themes likely to influence how voice translation products develop. The company explicitly frames the core challenge as balancing latency and accuracy—a constraint that affects model design and system architecture. DeepL’s platform add-ons for Zoom and Microsoft Teams point to a distribution strategy tied to where meetings already happen, which could accelerate adoption if translation performs reliably in live sessions.

DeepL’s focus on a configurable translation layer—through custom vocabulary learning and an API—suggests a push toward deployable systems for specific environments, including call centers and training settings. The company’s stated direction toward developing an end-to-end voice translation model that skips text indicates a potential industry direction: reducing pipeline complexity and potentially improving real-time behavior.

Source: TechCrunch