Dear friends,
The Voice Stack is improving rapidly. Systems that interact with users via speaking and listening will drive many new applications. Over the past year, I’ve been working closely with DeepLearning.AI, AI Fund, and several collaborators on voice-based applications, and I will share best practices I’ve learned in this and future letters.
However, this process introduces latency, and users of voice applications are very sensitive to latency. When DeepLearning.AI worked with RealAvatar (an AI Fund portfolio company led by Jeff Daniel) to build an avatar of me, we found that getting TTS to generate a voice that sounded like me was not very hard, but getting it to respond to questions using words similar to those I would choose was. Even after a year of tuning our system — starting with iterating on multiple, long, mega-prompts and eventually developing complex agentic workflows — it remains a work in progress. You can play with it here.
Keep building! Andrew
A MESSAGE FROM DEEPLEARNING.AIAI coding agents do more than autocomplete. They help you debug, refactor, and design applications. Learn how coding agents work under the hood, so you can streamline your projects and build applications such as a Wikipedia data-analysis app! Enroll Now.
News
Reading Minds, No Brain Implant RequiredTo date, efforts to decode what people are thinking from their brain waves often relied on electrodes implanted in the cortex. New work used devices outside the head to pick up brain signals that enabled an AI system, as a subject typed, to accurately guess what they were typing. What’s new: Researchers presented Brain2Qwerty, a non-invasive method to translate brain waves into text. In addition, their work shed light on how the brain processes language. The team included people at Meta, Paris Sciences et Lettres University, Hospital Foundation Adolphe de Rothschild, Basque Center on Cognition, Brain and Language, Basque Foundation for Science, Aix-Marseille University, and Paris Cité University. Gathering brainwave data: The authors recorded the brain activity of 35 healthy participants who typed Spanish-language sentences. The participants were connected to either an electroencephalogram (EEG), which records the brain’s electrical activity via electrodes on the scalp, or a magnetoencephalogram (MEG), which records magnetic activity through a device that surrounds the head but isn’t attached. 15 participants used each device and five used both.
Thoughts into text: Brain2Qwerty used a system made up of a convolutional neural network, transformer, and a 9-gram character-level language model pretrained on Spanish Wikipedia. The system classified the text a user typed from their brain activity. The authors trained separate systems on MEG and EEG data.
Results. The authors’ MEG model achieved 32 percent character error rate (CER), much higher accuracy than the EEG competitors. Their EEG system outperformed EEGNet, a model designed to process EEG data that had been trained on the authors’ EEG data. It achieved 67 percent CER, while EEGNet achieved 78 percent CER. Behind the news: For decades, researchers have used learning algorithms to interpret various aspects of brain activity with varying degrees of success. In recent years, they’ve used neural networks to generate text and speech from implanted electrodes, generate images of what people see while in an fMRI, and enable people to control robots using EEG signals. Why it matters: In research into interpreting brain signals, subjects who are outfitted with surgical implants typically have supplied the highest-quality brain signals. fMRI scans, while similarly noninvasive, are less precise temporally, which makes them less useful for monitoring or predicting language production. Effective systems based on MEG, which can tap brain signals precisely without requiring participants to undergo surgery, open the door to collecting far more data, training far more robust models, and conducting a wider variety of experiments. We’re thinking: The privacy implications of such research may be troubling, but keep in mind that Brain2Qwerty’s MEG system, which was the most effective approach tested, required patients to spend extended periods of time sitting still in a shielded room. We aren’t going to read minds in the wild anytime soon.
Big AI Spending Continues to RiseTop AI companies announced plans to dramatically ramp up their spending on AI infrastructure. What’s new: Alphabet, Amazon, Meta, Microsoft, and others will boost their capital spending dramatically in 2025, pouring hundreds of billions of dollars into data centers where they process AI training, the companies said in their most recent quarterly reports. The surge suggests that more-efficient approaches to training models won’t dampen the need for greater and greater processing power. How it works: Capital expenditures include long-term purchases like land, buildings, and computing hardware rather than recurring costs like salaries or electricity. The AI leaders signaled that most of this spending will support their AI efforts.
Behind the news: DeepSeek initially surprised many members of the AI community by claiming to have trained a high-performance large language model at a fraction of the usual cost.
Why it matters: DeepSeek-R1’s purported training cost fueled fears that demand for AI infrastructure would cool, but the top AI companies’ plans show that it’s not happening yet. A possible explanation lies in the Jevons Paradox, a 19th-century economic theory named after the English economist William Stanley Jevons. As a valuable product becomes more affordable, demand doesn’t fall, it rises. According to this theory, even if training costs tumble, the world will demand ever greater processing power for inference.
Learn More About AI With Data Points!AI is moving faster than ever. Data Points helps you make sense of it just as fast. Data Points arrives in your inbox twice a week with six brief news stories. This week, we covered Microsoft building a generative world-action model and Anthropic releasing Claude 3.5 Sonnet as a hybrid reasoning model. Subscribe today!
Deepfake Developers Appropriate Celebrity LikenessesA viral deepfake video showed media superstars who appeared to support a cause — but it was made without their participation or permission. What’s new: The video shows AI-generated likenesses of 20 Jewish celebrities including from Scarlett Johansson to Simon & Garfunkel. They appear wearing T-shirts that feature a middle finger inscribed with the Star of David above the word “KANYE.” The clip, which ends with the words “Enough is enough” followed by “Join the fight against antisemitism,” responds to rapper Kanye West, who sold T-shirts emblazoned with swastikas on Shopify before the ecommerce platform shut down his store. Who created it: Israeli developers Guy Bar and Ori Bejerano generated the video to spark a conversation about antisemitism, Bar told The Jerusalem Post. The team didn’t reveal the AI models, editing tools, or techniques used to produce the video. Johansson reacts: Scarlett Johansson denounced the clip and urged the U.S. to regulate deepfakes. In 2024, she objected to one of the voices of OpenAI’s voice assistant, which she claimed resembled her own voice, leading the company to remove that voice from its service. The prior year, her attorneys ordered a company to stop using an unauthorized AI-generated version of her image in an advertisement. Likenesses up for grabs: Existing U.S. laws protect some uses of a celebrity’s likeness in the form of a photo, drawing, or human lookalike, but they don’t explicitly protect against reproduction by AI systems. This leaves celebrities and public figures with limited recourse against unauthorized deepfakes.
Why it matters: Non-consensual deepfake pornography is widely condemned, but AI enables many other non-consensual uses of someone’s likeness, and their limits are not yet consistently coded into law. If the creators of the video that appropriated the images of celebrities had responded to Johansson’s criticism with an AI-generated satire, would that be a legitimate exercise of free speech or another misuse of AI? Previously, an ambiguous legal framework may have been acceptable because such images, and thus lawsuits arising from them, were uncommon. Now, as synthetic likenesses of specific people become easier to generate, clear legal boundaries are needed to keep misuses in check. We’re thinking: Creating unauthorized lookalikes of existing people is not a good way to advance any cause, however worthy. Developers should work with businesses policymakers to establish standards that differentiate legitimate uses from unfair or misleading exploitation.
Reasoning in Vectors, Not TextAlthough large language models can improve their performance by generating a chain of thought (CoT) — intermediate text tokens that break down the process of responding to a prompt into a series of steps — much of the CoT text is aimed at maintaining fluency (such as “a”, “of”, “we know that”) rather than reasoning (“a² + b² = c²”). Researchers addressed this inefficiency. What’s new: Shibo Hao, Sainbayar Sukhbaatar, and colleagues at Meta and University of California San Diego introduced Coconut (Chain of Continuous Thought), a method that trains large language models (LLMs) to process chains of thought as vectors rather than words. Key insight: A large language model (LLM) can be broken into an embedding layer, transformer, and classification layer. To generate the next text token from input text, the embedding layer embeds the text; given the text, the transformer outputs a hidden vector; and the classification layer maps the vector to text-token probabilities. Based on these probabilities, a decoding algorithm selects the next token to generate, which feeds back into the input text sequence to generate the next vector, and so on. When a model generates a CoT, committing to a specific word at each step limits the information available to the meanings of the words generated so far, while a vector could represent multiple possible words. Using vectors instead of text enables the CoT to encode richer information. How it works: The authors built three LLMs by fine-tuning a pre-trained GPT-2 on three datasets of prompts, CoTs, and final outputs: GSM8k (grade-school math word problems); ProntoQA (questions and answers about fictional concepts expressed in made-up words, including synthetic CoTs in natural language); and (3) ProsQA, a more challenging question-answering dataset introduced by the authors, inspired by ProntoQA but with longer reasoning steps.
Results: The authors compared their method to a pretrained GPT-2 that was fine-tuned on the same datasets to predict the next word, including reasoning.
Yes, but: On GSM8k, Coconut achieved 34.1 percent accuracy, while the baseline LLM achieved 42.9 percent. However, it generated significantly fewer vectors and tokens than the CoT generated tokens. Coconut generated 8.2 vectors on average compared to the baseline LLM’s 25 text tokens. Why it matters: A traditional CoT commits to a single word at each step and thus encodes one reasoning path in a single CoT. Vectors are less interpretable to humans than language, but the model’s output layer can still decode the thought vectors into probabilities over tokens. Further, inspecting the distribution of words stored along all continuous CoT vectors offers a way to understand multiple potential thought paths stored in one continuous CoT. We’re thinking: LLMs typically learn to reason over text, mainly because text data is widely available to train on. In contrast, neuroscience shows that the part of the human brain responsible for language largely goes quiet during reasoning tasks, which suggests that explicit language is not a key mechanism for reasoning. Coconut takes an intriguing step to enable LLMs to explore representations that don’t encode the limitations of language.
Work With Andrew Ng
Join the teams that are bringing AI to the world! Check out job openings at DeepLearning.AI, AI Fund, and Landing AI.
Subscribe and view previous issues here.
Thoughts, suggestions, feedback? Please send to thebatch@deeplearning.ai. Avoid our newsletter ending up in your spam folder by adding our email address to your contacts list.
|