|
Dear friends,
Voice-based AI that you can talk to is improving rapidly, yet most people still don’t appreciate how pervasive voice UIs (user interfaces) will become. Today, we use a keyboard and mouse to control most desktop and web applications. In the future, I hope we will be able to additionally talk to many of these applications to steer them. I’m particularly excited about the work of Vocal Bridge (an AI Fund portfolio company), where CEO Ashwyn Sharma is leading the way to provide developer tools that enable this.
I’ve written about the tradeoff between latency and intelligence. The core problem is that while voice-in-voice-out models have low latency (which is important for verbal communications), they are hard to control and suffer from low reliability/intelligence. In comparison, a pipeline for Speech-to-text → LLM/Agentic AI → Text-to-speech gives high reliability but introduces excessive latency. Vocal Bridge implemented a custom architecture that uses a foreground agent to converse with the user in real time — thus ensuring low latency — and a background agent to manage a complex agentic workflow, reason, apply guardrails, call tools, and whatever else is needed to produce high-quality answers and actions — thus ensuring high intelligence.
I don't expect voice UIs to completely replace older interfaces. Instead, they will complement them, just as the mouse complements the keyboard. In some contexts, such as when working in close proximity to others, users will prefer to type rather than speak. But the potential for voice UIs goes well beyond the currently dominant use cases of automating call centers and providing an alternative to typing. In my math-quiz app, the application can speak and also update the questions and animations shown on the screen in response to spoken (or typed) inputs. This multimodal visual+voice interaction creates a much richer user experience than the voice-only interactions that many voice AI companies have focused on. One key to making it work is a background-agent loop that can bidirectionally receive input from the UI as well as call tools to update the UI.
Voice UIs will be an important building block for AI applications. Only a minuscule fraction of the world's developers have ever created a voice app, so this is fertile ground for building. If you’d like to try adding voice to an application, try out Vocal Bridge for free here.
Keep building! Andrew
A MESSAGE FROM DEEPLEARNING.AIWe just released the AI Dev 26 agenda! Hear from teams at Google DeepMind, Oracle, AMD, and more across two days of talks, workshops, and demos—hosted by Andrew Ng. See what’s planned and start mapping your schedule
News
Inside Claude Code
The inner workings of the popular coding agent Claude Code are available for all to see.
What’s new: A recent version of Claude Code’s Node.js package accidentally included a key that revealed the code behind its command-line interface. Chaofan Shou, an intern at the blockchain startup Solayer Labs, unlocked the code and published it. Engineers rapidly deciphered its secrets. What happened: Typically, when a software company publishes closed-source code, a bundler tool scrambles the source files. But when Anthropic published version 2.1.88 to Claude Code’s npm registry on March 30, it included a source map file that serves as a translation key to decode the files.
How Claude Code works: Engineers who studied the source code say Claude Code is built less like a chatbot wrapper and more like a small, dedicated operating system.
Future capabilities?: The source map also reveals some of Anthropic’s possible plans for Claude. For instance, several undisclosed features sit behind flags that compile to “false” in the published build, a sign that they are currently in-progress and may be included in a future release.
Why it matters: The leak offers a peek under the hood of one of the most advanced and popular agentic systems available. We can see how Claude Code works and how it may work in the near future, revise our own systems to match, or differentiate our products by making different choices.
We’re thinking: The AI community is rightly concerned that software agents can inadvertently delete codebases or publish private files. Humans can, too!
OpenAI Exits Video Generation
OpenAI plans to shut down its video generator Sora in a sudden retreat from the video market.
What’s new: OpenAI will discontinue Sora, a high-profile follow-up to ChatGPT that the company had hoped would become another mass-market sensation, to reallocate resources to more profitable investments, The Wall Street Journal reported. Access to the model via web and app will end on April 26, and the API will close on September 24. The Sora team will be redirected to longer-term projects such as world models and robotics. In addition, OpenAI will consolidate its browser, the coding tool Codex, and the ChatGPT app into a single desktop application, The Wall Street Journal wrote in a separate report.
How it works: Sora produces high-definition videos of up to 25 seconds long that earned acclaim for their realism and visual quality. However, generating each clip takes minutes and requires a much larger amount of processing power than producing text or images. OpenAI previewed the model in February 2024. It updated the model and made it available via an iOS app in September 2025.
Behind the news: In late 2025, OpenAI took advantage of Sora to form a high-profile partnership with Disney. OpenAI would license Disney characters and train its models on Disney footage, and Disney would invest up to $1 billion in OpenAI. Disney planned to show Sora videos on its streaming service Disney+ and use Sora to help create pre-production visualizations, marketing campaigns, and special effects. With Sora’s impending demise, the partnership is effectively over.
Why it matters: OpenAI has surrendered leadership in video generation, clearing the way for other companies — among several strong contenders — to vie for dominance. When it launched Sora two years ago, OpenAI envisioned another ChatGPT moment. It wanted its generated videos to thrill the mass market and achieve maximum cultural impact. But the arithmetic didn’t make sense. Video generation didn’t attract as many paid subscribers as applications for business and coding, and the costs of training and running video models proved too great to bear.
We’re thinking: The era in which an AI demo — however impressive — is sufficient to establish leadership may be drawing to a close. The field is maturing rapidly, and creating sustainable value is becoming a top priority.
Learn More About AI With Data Points!
AI is moving faster than ever. Data Points helps you make sense of it just as fast. Data Points arrives in your inbox twice a week with six brief news stories. Last week, we covered Perplexity’s expansion into autonomous AI agents across desktop and enterprise tools and Nvidia’s secure infrastructure for deploying AI agents in production. Subscribe today!
Gemini’s Music Generator
Google added a music generator to Gemini and YouTube, putting a model that produces synthetic songs in front of hundreds of millions of users.
What’s new: Lyria 3 takes text descriptions or images and generates 30-second audio clips that can include instruments, singing voices, and song lyrics in several languages. Google took measures to ensure that the model’s output doesn’t violate copyrights: licensing its training data, filtering outputs for similarity to copyrighted works, and avoiding reproduction of an artist’s sonic likeness.
How it works: Google disclosed only a high-level overview of Lyria 3’s architecture and training. Like latent diffusion image generators, which produce images by removing noise from embeddings of pure noise, Lyria 3 removes noise from representations of audio during a given slice of time. The Batch previously described an audio diffusion process developed by Stability.AI as well as Google’s earlier MusicLM music generation method.
Behind the news: Lyria 3 arrives as the music industry is aggressively prosecuting developers of AI music generators for alleged copyright violations. The leading music generators, Suno and Udio, no longer generating music from scratch, leaving Google among a dwindling number of developers that do.
Why it matters: Music generation is finding its place in an entertainment industry dominated by large, powerful incumbents. Lyria 3 puts it in front of more than 750 million Gemini users, dwarfing the current user bases of Suno (around two million paid subscribers) and Udio (around 3.3 million monthly users). It continues to produce original music — the direction that put Suno and Udio in the crosshairs of the world’s biggest recording companies — but adds safeguards, such as training on licensed music, to avoid aggravating copyright holders.
We’re thinking: Music generators produce impressive, versatile, surprisingly human-like output, yet we’re still waiting for generated music to have its ChatGPT moment. It may happen quietly as, say, producers of YouTube clips increasingly use Lyria 3 rather than pre-recorded sources.
Learning Long Context at Inference
Large language models typically become less accurate and slower when they process longer contexts, but researchers enabled an LLM to keep accuracy stable and inference time constant as its context grew.
What’s new: Arnuv Tandon, Karan Dalal, and colleagues at the nonprofit Astera Institute, Nvidia, Stanford, UC Berkeley, and UC San Diego introduced Test-Time Training, End-to-End (TTT-E2E), a method that compresses context into a transformer’s weights by training it during inference.
Key insight: LLMs built on the transformer architecture attend to the entire context (all tokens input and output so far) to generate the next output token. Thus, generating each new output token takes more processing than the last, potentially making inference expensive and slow. Instead of attending to the entire context, a transformer can restrict attention to a smaller window of fixed size — which keeps the time required to generate each output token constant — and learn from the context by updating its weights.
How it works: The authors built a 3 billion-parameter transformer that implemented sliding-window attention, which restricted attention to a fixed window of 8,000 tokens. They pretrained the model on sequences of 8,000 tokens — 164 billion tokens total — drawn from a filtered dataset of text scraped from the web. To enable it to track longer contexts, they fine-tuned it on sequences of up to 128,000 tokens drawn from the Books subset of The Pile. The authors used a form of meta-learning, or learning how to learn; in this case, the model learns how to learn from input provided at inference time.
Results: The authors compared TTT-E2E to a transformer with conventional attention as well as highly efficient architectures such as Mamba 2 (a recurrent neural network-style model) and Gated DeltaNet (which uses a custom form of linear attention). Its accuracy slightly exceeded that of the transformer over long contexts — except on Needle-in-a-Haystack, which involves recovering a short target string from a long context — and it generated output tokens as rapidly as the more-efficient architectures as context grew. Its exceptional inference speed came at the cost of slower and more complex training.
Why it matters: Learning at inference offers an approach to processing long contexts that’s simpler than designing custom attention mechanisms or recurrent architectures. This work reframes the problem as a trade-off between training and inference: Processing at inference is less expensive and more consistent per token, but training is slower.
We’re thinking: This model took it to heart when we said: Keep learning!
Work With Andrew Ng
Join the teams that are bringing AI to the world! Check out job openings at DeepLearning.AI, AI Fund, and Landing AI.
Subscribe and view previous issues here.
Thoughts, suggestions, feedback? Please send to thebatch@deeplearning.ai. Avoid our newsletter ending up in your spam folder by adding our email address to your contacts list.
|