|
Dear friends,
The ways we prompt AI are very different in 2026 than 2022 when ChatGPT came out. Some people are still using LLMs primarily by asking them short questions. But the models can do much more, like think for minutes, ingest many documents as context, and use web search and other tools.
I also cover intuitions about how these models work under the hood, so learners know when to trust their output and when not to. Along the way: you’ll see flying squirrels, a creativity test, some of my old family photos, and fireworks.
Keep prompting! Andrew
A MESSAGE FROM DEEPLEARNING.AILearn how to get more accurate answers, better writing, and more useful outputs from AI tools like ChatGPT, Claude, and Gemini. Taught by Andrew Ng, this course covers finding information, brainstorming, and building simple apps. Enroll today
News
GPT-5.5 Outperforms, Hallucinates
The latest update of OpenAI’s flagship model sets new states of the art in important benchmarks but has difficulty distinguishing between what it does and doesn't know.
What’s new: GPT-5.5 is a closed vision-language models that’s built for agentic coding, computer use, and knowledge work. GPT-5.5 Pro is the same model but processes reasoning tokens in parallel during inference. OpenAI set the API prices at roughly double the per-token rates of GPT-5.4.
How it works: OpenAI disclosed few details about how it built GPT-5.5. As is typical of high-performance models, the training data was a mix of publicly available data scraped from the web, licensed from partners, and collected from users and human trainers. The model was trained via reinforcement learning to reason before responding.
Performance: GPT-5.5 generally delivers top performance in objective benchmarks, especially in tests of knowledge, agentic tasks, and abstract visual reasoning. However, it falls behind competitors on subjective evaluations. It’s also more likely to confidently deliver incorrect output.
Yes, but: GPT-5.5 knows more than its peers, but it answers incorrectly more often and acknowledges ignorance less often. The AA-Omniscience benchmark poses 6,000 expert-level questions across business, law, health, humanities, science/engineering, and software engineering. It includes a “hallucination rate” that is the ratio of wrong answers to the sum of wrong answers, partially wrong answers, and abstentions. By this measure, GPT-5.5 set to high reasoning hit 85.53 percent, notably worse than Claude Opus 4.7 set to max reasoning (36.18 percent) and Gemini 3.1 Pro Preview at (49.87 percent). Apollo Research separately found that GPT-5.5 lied about completing an impossible programming task in 29 percent of samples, a significant jump from GPT-5.4’s 7 percent. OpenAI’s internal monitoring of coding-agent traffic showed a similar pattern.
Security implications: OpenAI released results of VulnLMP, an internal evaluation that tests whether a model can develop exploits against widely deployed software. GPT-5.5 undertook multi-day research campaigns and identified potential memory-related vulnerabilities in a variety of targets, but it did not produce an exploit that was confirmed by OpenAI’s evaluation harness. Under OpenAI’s Preparedness Framework, this evidence places GPT-5.5 within the “high” tier of cybersecurity threats, short of the “critical” tier label that would describe models that independently produce working exploits against real targets.
Why it matters: Evaluations of objective performance and human preferences are telling different stories about GPT-5.5. OpenAI regained the lead on the Artificial Analysis Intelligence Index, but the picture flips when it comes to subjective, head-to-head comparisons. Claude Opus models occupy the top spots in LMArena’s Text, Vision, Document, Search, and Code rankings, while GPT-5.5 doesn’t crack the top five on most. Benchmarks measure what models can accomplish, human preference what they’re like to work with. Production decisions usually weigh both, and — according to the measures that are available so far — the two are diverging.
We’re thinking: Top AI companies continue to push the frontier at a dizzying pace. GPT-5.5 is the fourth flagship launch since February, following Anthropic Claude Opus 4.7, GPT-5.4, and Google Gemini 3.1 Pro Preview. Each one reshuffled the top of the Artificial Analysis Intelligence Index, which can be viewed as a proxy for general capability in real-world tasks. Developers should design their software stacks to swap models as easily as bumping a dependency.
Big AI’s Plans Strain CO2 Pledges
Commitments by large AI companies to limit emissions of greenhouse gases are at risk as those companies pursue a massive build-out of data centers, many of which will be powered by fossil fuels in the near term and possibly beyond.
What’s new: Alphabet, Amazon, Meta, and Microsoft have begun to acknowledge that keeping up with projected demand for AI is interfering with earlier plans to stop raising the concentration of greenhouse gases to the atmosphere, Associated Press reported. (Disclaimer: Andrew Ng is a member of Amazon’s board of directors.)
How it works: Electricity consumed by top tech companies has increased significantly over the last few years, and with it their emissions of greenhouse gases that contribute to climate change, despite ongoing efforts to reduce emissions. While they have emphasized clean sources of energy including wind, solar, geothermal, and nuclear, lately they have begun to develop natural gas power plants to meet rapidly rising demand for AI.
Behind the news: In the years following the 2015 Paris Climate Agreement, which commits governments to limiting global warming by 2 degrees Celsius above pre-industrial levels, many companies signed corporate pledges to meet goals intended to slow climate change. For instance, over 600 companies signed The Climate Pledge co-founded by Amazon and Global Optimism in 2019, which commits companies to reaching net-zero emissions of greenhouse gases by 2040. The Science-Based Targets initiative, which launched in 2015, is another corporate agreement that requires companies to set climate targets that align with the Paris Agreement. The top AI companies have embraced these principles and publish annual reports that document efforts to meet their commitments.
Why it matters: In 2024, data centers accounted for roughly 1.5 percent of electricity consumption globally and 4.4 percent in the U.S. The U.S. figure is projected to rise to as much as 12 percent within the next few years. While big AI companies thought they would have sufficient energy from clean sources, the recent sharp rise in demand is pushing them toward further reliance on fossil fuels that produce climate-changing greenhouse gases.
We’re thinking: Top AI companies have invested meaningfully in renewable energy like wind and solar and next-generation sources like nuclear and geothermal power. However, these sources still face scaling problems, which is why companies have turned to natural gas plants to meet growing energy demands. That’s a worrisome trend. However, for the amount of work that they do, well run data centers are still the most efficient option, and we hope that further efficiency gains in AI will balance rising emissions.
Learn More About AI With Data Points!
AI is moving faster than ever. Data Points helps you make sense of it just as fast. Data Points arrives in your inbox twice a week with six brief news stories. This week, we covered insurers limiting coverage for AI-related risks and OpenAI ending its exclusive agreement with Microsoft. Subscribe today!
Kimi K2.6 Challenges Open-Weights Champs
Moonshot AI’s updated Kimi model handles longer autonomous coding sessions and scales up its multi-agent orchestration relative to its predecessor.
What’s new: Kimi K2.6 is a 1 trillion-parameter vision-language model that performs neck and neck with Qwen3.6 Max Preview and the newly released DeepSeek V4 and falls just behind top closed models. It’s designed to generate code in a plan-write-test-debug loop that can last for days, and it can instantiate hundreds of agents that collaborate on a single task. It also produces fewer hallucinations than its predecessor.
How it works: Kimi K2.6 reuses the architecture introduced with Kimi K2 and refined in Kimi K2.5, including the multi-headed latent attention (an attention variant that reduces memory requirements by compressing keys and values) and MoonViT vision encoder (400 million parameters). Moonshot has not disclosed how Kimi K2.6 differs with respect to training data and methods.
Performance: Kimi K2.6 leads open-weights models on some benchmarks of intelligence and agentic capability and ranks highly relative to its peers in subjective tests of human preferences. However, it trails leading closed models on benchmarks that evaluate reasoning and coding large projects as well as human preferences.
Behind the news: The ability to stay on task across hours of autonomous execution emerged as a competitive frontier in late 2025. Anthropic’s Claude Code, OpenAI’s Codex, and Alibaba’s Qwen3-Coder all targeted this capability in their most recent releases. Kimi K2, released in July 2025, was an early open-weights entrant in agentic tool use, and the family has been updated every few months since with growing emphasis on long-horizon execution.
Why it matters: Moonshot has steadily extended the duration over which Kimi K2 family models can usefully execute tasks autonomously: first short reasoning traces, then multi-step tool use, multi-hour coding sessions, and now multi-day projects. Each extension widens the interval between human check-ins required to keep agents on track.
We’re thinking: Sustained autonomy and low hallucination rates are related, but less and less so. If an agent makes a mistake, it can check its work, find the mistake, and fix it.
Strategic Thinking in LLMs vs. Humans
While large language models can behave in human-like ways, the similarities are superficial. A simple strategy game revealed clear differences in their strategic approaches.
What’s new: Caroline Wang and colleagues at University of Texas at Austin and Google interpreted patterns of decision-making by humans and LLMs as they played the classic game of rock-paper-scissors. They found that LLMs sometimes model their opponents with greater sophistication than people do.
Key insight: Given recorded gameplay, an LLM can iteratively improve code that predicts a player’s next move. If the code predicts the player’s actions with significant accuracy, we can assume that its decision-making algorithms are functionally similar to those the player used. Computer code is interpretable, making it possible to discern such algorithms and compare those used by humans and LLMs.
How it works: In games of rock-paper-scissors, he authors pitted individual LLMs (Gemini 2.5 Pro, Gemini 2.5 Flash, GPT-5.1, and GPT-OSS 120B) against each of 15 preprogrammed bots of varying complexity. They recorded each player's moves in 20 games of 300 sequential rounds each. Previous work provided records of similar records of games between humans and the same bots. The authors tracked the round-by-round choices made by each player — AI and human — and whether they won, lost, or tied. Then they used AlphaEvolve, an agentic method that iteratively optimizes code through an evolutionary process, to improve Python programs that predicted the next move for each LLM individually and humans as a group.
Results: Using game data that AlphaEvolve didn’t process, the authors compared how well each program predicted the other players’ moves. Then they examined the programs to determine what strategies each player used.
Why it matters: While researchers have found ways to understand some aspects of neural network behavior, large language models remain black boxes in many ways. Synthesizing code directly from LLM behavior offers a powerful tool to interpret their decision-making.
We’re thinking: It’s tempting to assume that LLMs learn to mimic human behavior as represented by their training data. Finding that they can encode a gaming strategy more systematically than the average human demonstrates a different sort of learning.
Work With Andrew Ng
Join the teams that are bringing AI to the world! Check out job openings at DeepLearning.AI, AI Fund, and Landing AI.
Subscribe and view previous issues here.
Thoughts, suggestions, feedback? Please send to thebatch@deeplearning.ai. Avoid our newsletter ending up in your spam folder by adding our email address to your contacts list.
|