|
Dear friends,
In last week’s letter, I explained how effective agentic AI development needs a disciplined evals and error analysis process, and described an approach to performing evals. This week, I’d like to summarize the core ideas behind error analysis and describe some best practices. Given the rapid pace of improvement in LLMs, when error analysis points to a problem, your options for how to address it are greater than before. Let me explain.
Specifically, it is fine to start by reading one or just a handful of traces informally to get a sense of what might be going wrong. For example, if you see that the web search query terms in your Deep Researcher — step (i) above — frequently make no sense, that points you to an initial area to focus your efforts on improving. As the system matures, you can move incrementally toward more rigorous error analysis. For example, you might eventually end up with a regularly refreshed dataset of thousands of examples where the performance is poor, and carry out rigorous evaluations that show exactly what percentage of the time each of the steps (i) - (iv) contributed to problems with the final output, and also in what specific ways those steps fell short.
This type of analysis is extremely useful for deciding where to focus your efforts to improve the overall agentic workflow’s performance!
Keep building! Andrew
A MESSAGE FROM DEEPLEARNING.AI“Governing AI Agents,” a short course built in collaboration with Databricks, shows you how to integrate data governance into agentic workflows to handle data safely, securely, and accurately. Enroll for free
News
Reasoning Without “Thinking”
Reasoning models typically learn to undertake a separate process of “thinking” through their output of before they produce final response. Ant Group built a top non-reasoning model that can take similar steps as part of its immediate response.
What’s new: Ant Group, an affiliate of Alibaba and owner of the online payments provider Alipay, released Ling-1T, a huge, open, non-reasoning model that outperforms both open and closed counterparts.
How it works: The team emphasized chain-of-thought reasoning in both the pretraining and fine-tuning phases of development, but it didn't train the model to undertake a separate reasoning, or thinking, process before producing its final output. This means the model can reason selectively depending on the input.
Results: In Ant Group’s tests, Ling-1T generally outperformed three top non-reasoning models: DeepSeek-V3.1-Teriminus (thinking mode disabled), Moonshot Kimi-K2-Instruct, and OpenAI GPT-5 (thinking mode disabled), as well as Google Gemini 2.5 Pro set to minimum thinking (128 tokens).
Yes, but: The team published results of only one agentic benchmark and admits to limited performance in this area. It says it will improve agentic performance in future releases.
Behind the news: Concurrently with Ling-1T, Ant Group released a finished version of its 1 trillion-parameter reasoning model, Ring-1T, which was available previously as a preview. While Ling-1T’s performance exceeded that of top non-reasoning models, Ring-1T achieved second-place performance relative to reasoning models on almost every benchmark tested.
Why it matters: Ling-1T generally outperforms the mighty Kimi K2 and closes the gap between open and closed nonreasoning models. A ginormous parameter count and pretraining on chains of thought appear to have been key factors in this accomplishment. Having been pretrained with an intense focus on chains of thought, Ling-1T is primed to generate a chain of thought before it concludes a response, although not in a separate reasoning stage. Such training blurs the line between reasoning and non-reasoning models.
We’re thinking: Two years ago, weights for Ling-family models were closed, but in the past year Ant Group has released open weights for several. With consistent effort and investment, Ling has gone from a family that few had heard of to challenging the top dogs.
MCP Poses Security Risks
The ability to easily connect large language models to tools and data sources has made Model Context Protocol popular among developers, but it also opens security holes, research shows.
What’s new: Golan Yosef at Pynt, an API security firm, analyzed security risks of Model Context Protocol (MCP) servers. The work shows that when systems use multiple MCP servers, vulnerabilities rise rapidly.
How it works: MCP’s flexible, modular, dynamic design is a double-edged sword. It supports open-ended agentic interactions, but those very qualities make MCP servers vulnerable to exploitation. The study assessed security risks across more than 280 popular servers.
Results: The study identified widespread patterns of vulnerability that compound as systems add MCP servers.
Behind the news: Anthropic launched MCP in November 2024, and OpenAI and Microsoft adopted it by spring 2025. Despite its lax security, the protocol now connects to over 6,000 servers. Authentication remained optional until March, when OAuth 2.1 authorization frameworks were added. The change prevents unauthorized access to MCP servers, but it doesn’t prevent malicious or malformed data from flowing between servers and triggering unintended actions.
Why it matters: Securing individual MCP servers is important but not sufficient, because vulnerabilities can emerge from interactions among servers. Adding more servers can make a system more agentic, but it also compounds vulnerabilities. The study suggests that developers mitigate this “compositional risk” by using only the servers they need, constraining what each one is allowed to do, and testing transfers of data among them.
We’re thinking: Securing individual components is a tough task in its own right, but systems of MCP components must be secured at the system level.
Learn More About AI With Data Points!
AI is moving faster than ever. Data Points helps you make sense of it just as fast. Data Points arrives in your inbox twice a week with six brief news stories. This week, we covered Anthropic’s launch of Skills to customize Claude and the update of its small model Haiku to version 4.5. Subscribe today!
California Builds AI Regulatory Regime
In the absence of national laws that specifically regulate AI in the United States, California moved to regulate the technology within its own borders, passing four bills in less than a month.
What’s new: Governor Gavin Newsom signed into law SB 53, which requires large AI developers to disclose their safety protocols. In addition, SB 243 regulates chatbots, AB 316 makes developers liable for the actions of autonomous systems they build, and AB 853 requires AI-generated media to be labeled clearly.
How it works: Together, the bills don’t ban any particular applications outright or restrict AI development, but they require extensive disclosures, either to the state or directly to users. Some took effect immediately while others, such as SB 243, will phase in by January 2027.
What they’re saying: Reaction among AI developers has been mixed. SB 53 drew the loudest and most widely varied commentary.
Behind the news: SB 53 modifies parts of SB 1047, which Governor Newsom vetoed in 2024 after opposition from the tech community. That law would have required third-party audits and made companies liable for the uses of their models. Recently, Newsom also vetoed SB 7, which would have required employers to notify employees and applicants if AI systems were used to make employment decisions like hiring and firing.
Why it matters: California is the largest U.S. state by the sizes of its population and economy, as well as home of many of the world’s most prominent tech companies and startups, including Google, OpenAI, and Anthropic. These laws will affect users of CA-based tech worldwide along with companies that do business in the state.
We’re thinking: While these laws are better for the users, innovators, and businesses than the vetoed SB 1047, some of them perpetuate a major mistake of that legislation by placing regulatory burdens on models rather than applications. A model’s potential applications are unknown until someone implements them, and it makes no sense to limit — or burden with disclosure requirements — the good it might do. Applications, on the other hand, bring verifiable benefits and harms, and society would do well to limit the harms.
Better Agentic Prompts Automically
Honing an agent’s prompt can yield better results than fine-tuning the underlying large language model via reinforcement learning.
Results: The authors pitted custom and open-source agents that used GEPA against versions for which Qwen3-8B was fine-tuned on a given benchmark via Group Relative Policy Optimization (GRPO). They measured both the agents’ performance and the number of agent executions required.
Yes, but: The authors compared GEPA to fine-tuning via reinforcement learning using a single, relatively small model. Questions remain regarding how the results would scale to larger models or generalize to other models, and how GEPA would compare to supervised fine-tuning.
Why it matters: Methodically revising prompts can help agents perform better than fine-tuning via reinforcement learning, and it requires far fewer examples and executions.
Work With Andrew Ng
Join the teams that are bringing AI to the world! Check out job openings at DeepLearning.AI, AI Fund, and Landing AI.
Subscribe and view previous issues here.
Thoughts, suggestions, feedback? Please send to thebatch@deeplearning.ai. Avoid our newsletter ending up in your spam folder by adding our email address to your contacts list.
|