Dear friends,

In last week’s letter, I explained how effective agentic AI development needs a disciplined evals and error analysis process, and described an approach to performing evals. This week, I’d like to summarize the core ideas behind error analysis and describe some best practices. Given the rapid pace of improvement in LLMs, when error analysis points to a problem, your options for how to address it are greater than before. Let me explain.

Take the problem of building a basic Deep Research agent that searches the web to write a detailed report on a topic like “recent developments in black-hole science.” An agent might take a sequence of steps to generate the final report, such as (i) use an LLM to generate a handful of web search queries related to the topic, (ii) call a web-search API to get lists of results, (iii) use an LLM to identify the most promising sources to fetch, and (iv) ask the LLM to use these sources to write the report.

If the final report is subpar compared to the work of a human researcher following the same steps, the gap in performance could be from any of the steps. A basic error analysis procedure might involve gathering a sample set of topics where the output is subpar, and reading the results of every step of the workflow — called the traces — to see which step most frequently generated results materially worse than a human would have. This is very valuable for deciding what step to focus on improving.

A common misconception of error analysis is that it takes a lot of work to get started. The key principle is to look at the steps of the workflow and see which steps did a bad job on a given input, often by benchmarking to human level performance (HLP). Assuming we are automating a task where HLP is desirable, then the most important thing is to systematically examine traces to understand when the agent is falling short of HLP. And just as we can get started with evals using a quick-and-dirty initial cut at it (maybe using just a handful of examples) followed by iterating to improve, so too with error analysis.

Specifically, it is fine to start by reading one or just a handful of traces informally to get a sense of what might be going wrong. For example, if you see that the web search query terms in your Deep Researcher — step (i) above — frequently make no sense, that points you to an initial area to focus your efforts on improving. As the system matures, you can move incrementally toward more rigorous error analysis. For example, you might eventually end up with a regularly refreshed dataset of thousands of examples where the performance is poor, and carry out rigorous evaluations that show exactly what percentage of the time each of the steps (i) - (iv) contributed to problems with the final output, and also in what specific ways those steps fell short.

Robot bakes pizza at 1000 degrees for 5 hours, causing a fire, illustrating mistake in error analysis.

This type of analysis is extremely useful for deciding where to focus your efforts to improve the overall agentic workflow’s performance!

In addition to improving the execution of individual steps, we can change how we decompose a complex task into steps. When it came to pipelines built using machine learning or deep learning rather than LLMs, I found that the structure of the workflow — that is, how you decompose an overall task into a sequence of steps to be carried out — changed rarely. It was a big deal to rearchitect this! But in the past couple of years, because LLMs are improving so rapidly, I see much more rapid iteration on the design of workflows.

For example, one very common pattern is ripping out scaffolding and letting the LLM do more. This is often a good move when you now have access to a smarter LLM than you did when you first built a workflow. For example, perhaps you once used an LLM to clean up downloaded web pages by removing navigational links, ads, extraneous HTML, and the like, before a separate LLM used the cleaned-up pages to write a report. Since LLMs have become smarter, you might decide to skip the first step and dump messier HTML into the final LLM without an initial clean-up step, which can introduce its own errors.

Another example: Perhaps a year ago, we used hard-coded rules to decide what web pages to fetch and when to fetch more, but today we might let an LLM-based agent make this decision more autonomously. As LLMs get smarter, I see many teams rearchitecting workflows to remove hard-coded steps or constraints that were previously needed to keep the system from going off the rails. One way to spot opportunities for doing this is if error analysis shows that a sequence of steps collectively underperforms compared to what a human might do, even though the performance of each individual step is good. This might indicate that the way those steps are carried out is too rigid.

I go through many more examples in the Agentic AI course. Check it out if you want to learn more about evals and error analysis.

Keep building!

Andrew

A MESSAGE FROM DEEPLEARNING.AI

“Governing AI Agents,” a short course built in collaboration with Databricks, shows you how to integrate data governance into agentic workflows to handle data safely, securely, and accurately. Enroll for free

News

A performance comparison table highlights Ling-1T's success in reasoning and coding tasks against rivals.

Reasoning Without “Thinking”

Reasoning models typically learn to undertake a separate process of “thinking” through their output of before they produce final response. Ant Group built a top non-reasoning model that can take similar steps as part of its immediate response.

What’s new: Ant Group, an affiliate of Alibaba and owner of the online payments provider Alipay, released Ling-1T, a huge, open, non-reasoning model that outperforms both open and closed counterparts.

Input/output: Text in (up to 128,000 tokens), text out (up to 32,000 tokens)
Architecture: Mixture-of-Experts (MoE) transformer, 1 trillion parameters, 50 billion parameters active per token
Performance: Outperformed leading non-reasoning models in 22 of 31 benchmark tests of reasoning, math, coding, general knowledge, and writing.
Availability: Weights free to download from HuggingFace and ModelScope for commercial and noncommercial uses under the MIT license, API $0.56/$0.112/$2.24 per million input/cached/output tokens via zenmux.ai
Undisclosed: Training data, specific training methods

How it works: The team emphasized chain-of-thought reasoning in both the pretraining and fine-tuning phases of development, but it didn't train the model to undertake a separate reasoning, or thinking, process before producing its final output. This means the model can reason selectively depending on the input.

The team pretrained Ling-1T on 20 trillion tokens. In the last part of pretraining, they used a curated subset in which over 40 percent consisted of chain-of-thought data.
They fine-tuned the model via supervised fine-tuning on examples that were augmented with chains of thought via CoT-Evo. CoT-Evo takes a training dataset and generates and evolves chains of thought (CoTs) for each example in the dataset. It evolves CoTs by repeatedly scoring them, selecting them (based on score, difference from other CoTs, and random chance), and modifying them via an LLM. The team fine-tuned Ling-1T on the examples with the highest-scoring CoTs.
In addition, they fine-tuned the model using a reinforcement learning algorithm developed internally called Linguistic-Unit Policy Optimization (LPO). Unlike GRPO and GSPO, LPO “treats sentences as the natural semantic action units, enabling precise alignment between rewards and reasoning behavior,” the company said.

Results: In Ant Group’s tests, Ling-1T generally outperformed three top non-reasoning models: DeepSeek-V3.1-Teriminus (thinking mode disabled), Moonshot Kimi-K2-Instruct, and OpenAI GPT-5 (thinking mode disabled), as well as Google Gemini 2.5 Pro set to minimum thinking (128 tokens).

Ling-1T achieved the highest performance on 22 of 31 benchmarks tested and best or second-best performance on 29 of 31 benchmarks that cover general knowledge, coding, math, reasoning, writing, and agentic tasks.
It performed best in the math and reasoning categories, achieving the best performance in all benchmarks tested. For instance, on math questions in AIME 2025, Ling-1T achieved 70.42 percent accuracy, whereas the second-best model, Gemini 2.5 Pro set to minimum thinking, achieved 70.10 percent accuracy.

Yes, but: The team published results of only one agentic benchmark and admits to limited performance in this area. It says it will improve agentic performance in future releases.

Behind the news: Concurrently with Ling-1T, Ant Group released a finished version of its 1 trillion-parameter reasoning model, Ring-1T, which was available previously as a preview. While Ling-1T’s performance exceeded that of top non-reasoning models, Ring-1T achieved second-place performance relative to reasoning models on almost every benchmark tested.

Why it matters: Ling-1T generally outperforms the mighty Kimi K2 and closes the gap between open and closed nonreasoning models. A ginormous parameter count and pretraining on chains of thought appear to have been key factors in this accomplishment. Having been pretrained with an intense focus on chains of thought, Ling-1T is primed to generate a chain of thought before it concludes a response, although not in a separate reasoning stage. Such training blurs the line between reasoning and non-reasoning models.

We’re thinking: Two years ago, weights for Ling-family models were closed, but in the past year Ant Group has released open weights for several. With consistent effort and investment, Ling has gone from a family that few had heard of to challenging the top dogs.

Graph showing increasing security risks from 9% to 92% as MCP servers rise from 1 to 10.

MCP Poses Security Risks

The ability to easily connect large language models to tools and data sources has made Model Context Protocol popular among developers, but it also opens security holes, research shows.

What’s new: Golan Yosef at Pynt, an API security firm, analyzed security risks of Model Context Protocol (MCP) servers. The work shows that when systems use multiple MCP servers, vulnerabilities rise rapidly.

How it works: MCP’s flexible, modular, dynamic design is a double-edged sword. It supports open-ended agentic interactions, but those very qualities make MCP servers vulnerable to exploitation. The study assessed security risks across more than 280 popular servers.

For each server, Yosef evaluated two properties: whether it would process inputs from unsafe sources that can’t be fully verified or controlled (such as emails, chats, Slack messages, or scraped web pages) and whether it allowed powerful actions like code execution, file access, or calling APIs. He deemed servers that had both traits to be high-risk, since it could execute an attacker’s instructions without a user’s approval.
He estimated how risk increases as systems use greater numbers of servers. (He didn’t disclose the formula or method used to derive the estimates.)
He validated his risk model by attacking real-world MCP setups, including cases where unsafe input from one server caused another server to execute commands automatically.

Results: The study identified widespread patterns of vulnerability that compound as systems add MCP servers.

Of the servers tested, 72 percent of servers tested exposed at least one sensitive capability to attackers, and 9 percent of servers tested were deemed high-risk.
13 percent of servers accepted inputs from unsafe sources, enabling attackers without direct access to their targets to deliver malicious text (HTML, emails, Markdown) that servers downstream might interpret as code.
Risk of an exploitable configuration compounded rapidly with the first few servers added before flattening. Combining 2 servers created 36 percent chance of a vulnerable configuration, Combining 3 reached 52 percent chance, 5 servers exceeded 71 percent change, and 10 servers approached 92 percent chance.
The study documents real-world examples in which attackers executed privileged actions. In one case, a plug-in web scraper fetched HTML, supplied by an attacker, that a Markdown parser interpreted as commands, which a shell plug-in duly executed.

Behind the news: Anthropic launched MCP in November 2024, and OpenAI and Microsoft adopted it by spring 2025. Despite its lax security, the protocol now connects to over 6,000 servers. Authentication remained optional until March, when OAuth 2.1 authorization frameworks were added. The change prevents unauthorized access to MCP servers, but it doesn’t prevent malicious or malformed data from flowing between servers and triggering unintended actions.

Why it matters: Securing individual MCP servers is important but not sufficient, because vulnerabilities can emerge from interactions among servers. Adding more servers can make a system more agentic, but it also compounds vulnerabilities. The study suggests that developers mitigate this “compositional risk” by using only the servers they need, constraining what each one is allowed to do, and testing transfers of data among them.

We’re thinking: Securing individual components is a tough task in its own right, but systems of MCP components must be secured at the system level.

Engaged in multitasking, man blends gaming with finance; sleek home office highlights tech and data trends.

Learn More About AI With Data Points!

AI is moving faster than ever. Data Points helps you make sense of it just as fast. Data Points arrives in your inbox twice a week with six brief news stories. This week, we covered Anthropic’s launch of Skills to customize Claude and the update of its small model Haiku to version 4.5. Subscribe today!

Official letterhead displays Governor Newsom's signature on AI regulation bill establishing state oversight.

California Builds AI Regulatory Regime

In the absence of national laws that specifically regulate AI in the United States, California moved to regulate the technology within its own borders, passing four bills in less than a month.

What’s new: Governor Gavin Newsom signed into law SB 53, which requires large AI developers to disclose their safety protocols. In addition, SB 243 regulates chatbots, AB 316 makes developers liable for the actions of autonomous systems they build, and AB 853 requires AI-generated media to be labeled clearly.

How it works: Together, the bills don’t ban any particular applications outright or restrict AI development, but they require extensive disclosures, either to the state or directly to users. Some took effect immediately while others, such as SB 243, will phase in by January 2027.

SB 53 requires that developers of frontier models, defined as those whose training requires processing greater than 10²⁶ integer or floating-point operations — a level currently associated with very large and powerful models — provide more transparency about their models’ capabilities and potential risks. It also requires that developers with annual revenue above $500 million publish safety frameworks that show how they follow industry and international standards and assess and mitigate risk. In addition, they must report on their models’ uses and capabilities at release and report any critical safety incidents within 15 days. Noncompliant developers could face fines of up to $1 million. The law protects whistleblowers within AI companies against retaliation and provides anonymous channels to report illegal or unsafe behavior. The bill takes effect in June 2026.
SB 243 aims to prevent chatbots from harming minors and other vulnerable users. It bars exposing minors to sexual content and requires developers to disclose that chatbots are AI-generated and provide a general warning that chatbots may not be suited for minors. The bill also requires developers to provide specific support to users who discuss suicide or self-harm and to issue an annual report on mental health issues related to using their chatbots.
AB 316 prohibits defendants in lawsuits from shifting responsibility onto AI systems by claiming that they harmed plaintiffs autonomously. It applies to anyone who develops, modifies, or uses an AI system.
AB 853 requires that AI-generated media be labeled clearly as such. Furthermore, it requires that all media (AI-generated or not) include information about who made it and how. The bill requires that cameras, audio recorders, computers, and other media-capture devices record such provenance data, and that large-scale media distributors (2,000,000 monthly active users or more) disclose it.

What they’re saying: Reaction among AI developers has been mixed. SB 53 drew the loudest and most widely varied commentary.

Collin McCune, head of government affairs at the venture capital firm Andreessen Horowitz, said SB 53 puts startups at a disadvantage: “States have an important role in regulating AI. But if lawmakers really want to protect their citizens, this isn’t the way. They should target harmful uses through consumer protection laws and similar safeguards — not dictate how technologists build technology.”
Chris Lehane at OpenAI opposed California’s approach: “History shows that on issues of economic competitiveness and national security — from railroads to aviation to the internet — America leads best with clear, nationwide rules, not a patchwork of state or local regulations. Fragmented state‑by‑state approaches create friction, duplication, and missed opportunities.”
Anthropic endorsed SB 53: “We’ve long advocated for thoughtful AI regulation, and our support for this bill comes after careful consideration of the lessons learned from California's previous attempt at AI regulation (SB 1047). While we believe that frontier AI safety is best addressed at the federal level instead of a patchwork of state regulations, powerful AI advancements won’t wait for consensus in Washington.”

Behind the news: SB 53 modifies parts of SB 1047, which Governor Newsom vetoed in 2024 after opposition from the tech community. That law would have required third-party audits and made companies liable for the uses of their models. Recently, Newsom also vetoed SB 7, which would have required employers to notify employees and applicants if AI systems were used to make employment decisions like hiring and firing.

Why it matters: California is the largest U.S. state by the sizes of its population and economy, as well as home of many of the world’s most prominent tech companies and startups, including Google, OpenAI, and Anthropic. These laws will affect users of CA-based tech worldwide along with companies that do business in the state.

We’re thinking: While these laws are better for the users, innovators, and businesses than the vetoed SB 1047, some of them perpetuate a major mistake of that legislation by placing regulatory burdens on models rather than applications. A model’s potential applications are unknown until someone implements them, and it makes no sense to limit — or burden with disclosure requirements — the good it might do. Applications, on the other hand, bring verifiable benefits and harms, and society would do well to limit the harms.

Flowchart details GEPA algorithm, featuring candidate filtering and performance improvement loops.

Better Agentic Prompts Automically

Honing an agent’s prompt can yield better results than fine-tuning the underlying large language model via reinforcement learning.

What’s new: Lakshya A. Agrawal and colleagues at UC Berkeley, Stanford, BespokeLabs.ai, Notre Dame, Databricks, and MIT developed GEPA, an algorithm that improves the performance of agentic systems by improving their prompts. The authors position it as an efficient alternative to fine-tuning an agent’s large language model via reinforcement learning.

Key insight: Agentic models trained via reinforcement learning typically must take a complicated series of actions to earn a simple reward, including calling a large language model multiple times for different purposes, or modules, of the workflow. But a well designed prompt can take into account the various problems an agent may run into and thus guide the model more efficiently. The trick is to write prompts that anticipate such problems. To accomplish this, a large language model can analyze an agent’s behavior as it responds to a given prompt, identify associations between the prompt and outcome (for instance, a failed tool call), and compose a more effective prompt.

How it works: Prompting agents based on Alibaba’s Qwen3-8B, the authors used GEPA to hone their performance on specific benchmarks. The method iteratively evolves a pool of candidate prompts, beginning with a simple prompt for each LLM call a module of the agent makes, such as “Respond to the query” or “Ensure the response is correct and adheres to the given constraints [specified in the benchmark inputs].” In each cycle, GEPA selects, modifies, and evaluates a prompt to generate a revised prompt that produces better results.

Given each prompt to be fed to the LLM (initially the default prompts, later revised prompts selected for their effectiveness), the agent responds to a random subset of examples from a benchmark’s training set.
GEPA selects which prompt to modify, alternating between the various modules. A separate Qwen3-8B instance examines the agent’s traces (generated text, tool calls, and results) and revises the prompt.
GEPA evaluates the revised prompt in a two-step process. First it feeds it to the agent along with the examples used previously and the prompts by other modules. If the revised prompt improves the agent’s performance, GEPA adds it to a pool of candidate prompts and then scores its performance on each example in the benchmark’s validation set.
From the pool, GEPA identifies prompts that achieved the highest score on at least one example. It selects a set of prompts (one for each module) for the next round of revision, prioritizing prompts that excelled on multiple questions.
GEPA repeats the previous steps until it has exhausted a predefined processing budget. It chooses the set of prompts that achieved the highest average score across all examples in the validation set.

Results: The authors pitted custom and open-source agents that used GEPA against versions for which Qwen3-8B was fine-tuned on a given benchmark via Group Relative Policy Optimization (GRPO). They measured both the agents’ performance and the number of agent executions required.

Across HotpotQA (questions that require reasoning over multiple paragraphs), IFBench (following instructions), HoVer (verifying facts), and PUPA (which gauges balance between helpfulness and unwanted sharing of personal information), agents that used GEPA consistently achieved better performance on all four.
Moreover, they did this with far greater efficiency, requiring up to 35 times fewer agent executions.

Yes, but: The authors compared GEPA to fine-tuning via reinforcement learning using a single, relatively small model. Questions remain regarding how the results would scale to larger models or generalize to other models, and how GEPA would compare to supervised fine-tuning.

Why it matters: Methodically revising prompts can help agents perform better than fine-tuning via reinforcement learning, and it requires far fewer examples and executions.

We’re thinking: While it’s unclear how this method compares to supervised fine-tuning, the ability to boost agentic performance without reinforcement learning may be especially valuable in low-data situations or where agent executions are expensive.

Work With Andrew Ng

Join the teams that are bringing AI to the world! Check out job openings at DeepLearning.AI, AI Fund, and Landing AI.

Subscribe and view previous issues here.

Thoughts, suggestions, feedback? Please send to thebatch@deeplearning.ai. Avoid our newsletter ending up in your spam folder by adding our email address to your contacts list.