|
Dear friends,
The AI world has become incredibly noisy. Social media, traditional media, and an army of marketers produce a cacophony of hype and content that are often secretly sales pitches for their products. Good ideas are buried in the noise, but it is hard to figure out what’s worth your time to learn. I want to explain plainly how we decide what to teach at DeepLearning.AI.
AI Engineering has important fundamentals, such as (i) using coding agents well, (ii) key building blocks like evals, error analysis, agentic workflows, and guardrails, and (iii) adjacent skills such as making basic product decisions or iterating quickly to build 0-to-1 products. Mastering these — sometimes best illustrated through a specific vendor’s offerings — matters more than mastering any one vendor’s tools. (At the same time, mastering multiple vendors’ tools lets you use them efficiently and has value, too.)
Many companies reach out and sometimes offer payment to teach with them, and we do consider suggestions of course topics and partners. But we prioritize what courses to teach and who to work with based only on what we think is best for learners. The engineers who built a tool’s advanced capabilities are often the people most qualified to share how it works, so I am deeply grateful to the many partners who have joined us to serve learners together.
Keep building! Andrew
A MESSAGE FROM DEEPLEARNING.AILearn to add voice to your AI agents and applications using three practical patterns: embed voice in an app, layer it onto an existing agent, or give your agent a tool to place outbound phone calls. Enroll for free
News
GPT-5.6 Lands in Limbo
OpenAI announced a preview of its GPT-5.6 family, including a top-tier model comparable to Claude 5 Mythos — but so far it’s available only to users that are selected by the U.S. government.
What’s new: OpenAI launched three closed vision-language models that descend in price and performance from GPT-5.6 Sol, the most capable model, to the mid-tier GPT-5.6 Terra, and the fast and less-expensive GPT-5.6 Luna. They include safeguards to deny access to potentially dangerous biological, chemical, and cybersecurity information. All three models, as well as versions in which the safeguards are relaxed, are available to a limited number of organizations the U.S. government has approved. The company promises a wider release in the next few weeks.
How it works: Following its usual practice, OpenAI revealed little about how it built the GPT-5.6 models but detailed the guardrails around them.
Performance: In OpenAI’s tests, the GPT-5.6 models achieved strong performance in coding, cybersecurity, and biology, but independent confirmation is scarce. Model Evaluation and Threat Research (METR), a nonprofit test lab, published an inconclusive evaluation of GPT-5.6 Sol’s ability to act autonomously. Tests by the nonprofit biosecurity outfit SecureBio are the only independent results available and indicate an unusually high degree of biology knowledge.
Behind the news: The U.S. government lately has begun to control the launch of top-performing AI models. OpenAI said it previewed the three models and their capabilities to the U.S. government before it launched the models. At the government’s request, OpenAI limited the launch to around 20 government-approved organizations. OpenAI did not disclose their names or what kinds of organizations they were. In its blog post, the company said this step was temporary and that it does not want government-controlled access to become usual. The same day that OpenAI announced GPT-5.6, the U.S. government granted Anthropic permission to offer its Claude Mythos 5 model to roughly 100 companies and federal agencies, two weeks after it forced Anthropic to suspend Claude Mythos 5 and Claude Fable 5 for all customers. Days later, Anthropic was able to restore Claude Fable 5.
Why it matters: The GPT-5.6 family brings concerns about AI’s security implications to the lower-priced tier. Cheaper, faster models used to draw less scrutiny because they were less capable, and speed was a premium. By adding to even GPT-5.6 Luna safeguards once reserved for the most advanced models, OpenAI is making life more difficult for developers of high-volume services. Engineers who work on legitimate applications, say, verifying codebase vulnerabilities or chemistry lab results, may now encounter refusals, added latency from paused output, or even account-level reviews.
We’re thinking: OpenAI said it is working with the White House on “a repeatable process for future model releases.” We hope this process grows more transparent, more predictable, and includes wider access than the launches of Claude 5 Mythos and GPT-5.6.
Fugu Blends Models Task by Task
Models that orchestrate the activities of other models and agents achieved state-of-the-art performance on a variety of benchmarks, outperforming the best individual models working alone.
What’s new: The Tokyo-based research lab Sakana AI released two models that delegate tasks to other models and agents under a single API. Fugu is designed for discrete tasks like basic coding and chat, while Fugu-Ultra is designed for long-running tasks like extensive coding and research. The two systems deliver performance comparable to Claude Mythos 5 and GPT-5.6 Sol without being dependent on a particular model.
How it works: A technical paper describes Fugu and Fugu-Ultra as members of a model family trained to orchestrate agentic workers. Fugu emphasizes speed and Fugu-Ultra performance. The two models can call a wide range of LLMs including undisclosed open models, closed models from Anthropic, Google, and OpenAI including Claude Opus 4.8, Gemini 3.1 Pro, and GPT-5.5, and instances of Fugu or Fugu-Ultra themselves.
Performance: Fugu-Ultra achieved state-of-the-art performance in multiple tests of coding, reasoning, and scientific knowledge performed by Sakana. In other categories, it fell just behind models that aren’t currently commercially available including Claude Fable 5. Fugu and Fugu-Ultra also showed strong performance at long-context reasoning and recall.
Behind the news: Sakana previously pursued other routes to combine AI models and benefit from their collective intelligence, including model merging. However, agentic orchestration is its most successful approach yet. OpenRouter recently presented Fusion, showing that a combination of leading models can outperform models working alone and that a judicious blend of lower-priced models can attain near-SOTA performance, offering more bang for the buck.
Why it matters: The U.S. government’s recent moves to restrict access to Claude Fable 5 and GPT-5.6 has raised interest in models that orchestrate other models and agents, as developers and organizations seek to reduce their dependence on any one provider or nation. Even before Anthropic withdrew widespread access to Claude Fable 5, many organizations were displeased by the company’s new data retention requirements. If developers have full control of the providers in their orchestrator’s model pool, they can turn models on and off depending on the sensitivity of the information being handled, to reduce costs, or to serve any operational function.
We’re thinking: Model orchestration is in many ways the natural outgrowth of agentic engineering; it’s just operating at a higher level of abstraction, switching between multiple models in addition to tools and subagents. This permits a new form of competition; instead of using a single AI company’s API to build applications, developers can put other companies’ models to work and become the API provider themselves.
Learn More About AI With Data Points!
AI is moving faster than ever. Data Points helps you make sense of it just as fast. Data Points arrives in your inbox twice a week with six brief news stories. This week, we covered GPT-5.6’s limited rollout to approved partners and restored access to Anthropic’s Claude Fable 5. Subscribe today!
Microsoft Strikes Out on Its Own
Microsoft, once OpenAI’s exclusive partner and still a major reseller of other companies’ AI models, built its own reasoning model from scratch.
What’s new: Microsoft introduced MAI-Thinking-1, its first reasoning language model that was not distilled or fine-tuned from a model built by a different developer. Microsoft describes MAI-Thinking-1 as a medium-sized model comparable to Claude Sonnet 4.6. It leads a family of seven MAI models unveiled at Microsoft’s Build conference, including MAI-Code-1-Flash, a small coding model available in GitHub Copilot and Visual Studio Code.
How it works: Microsoft built MAI-Thinking-1 by pretraining a base model, fine-tuning separate copies into specialized models to use as teachers, distilling them into a student model, and teaching the student to reason via reinforcement learning. The pretraining and midtraining data comprised 30 trillion and 3.55 trillion tokens respectively, including primarily human-generated data, including over 50 percent code. Post-training data included more than 5 million STEM questions and more than 160,000 coding questions.
Results: According to Microsoft's tests, MAI-Thinking-1 is strongest on mathematics and trails other models (including those from Anthropic, DeepSeek, and OpenAI) on graduate-level science and agentic coding. On AIME 2025, which tests the ability to solve competition math problems, MAI-Thinking-1 (97.0 percent) topped Claude Sonnet 4.6 (95.6 percent) and DeepSeek V3.2 (93.1 percent) but trailed Claude Opus 4.6 (99.8 percent). No independent evaluations or comparisons to more recent models have been published yet.
Behind the news: Microsoft has long relied on OpenAI’s models to power products such as Copilot and built earlier models of its own by drawing on those of its rivals. Its Phi family distilled OpenAI’s GPT-4 and GPT-5 models, and MAI-DS-R1 was a fine-tuned version of DeepSeek-R1. That changed in April 2026, when Microsoft and OpenAI amended their partnership, making Microsoft’s license to OpenAI’s models non-exclusive and freeing OpenAI to serve its products with any cloud provider.
Why it matters: Teams on the Microsoft stack can reach a capable reasoning model without adding a vendor or moving data out of the tools they already use. That may attract users among Microsoft’s large base of Azure and Copilot customers. Microsoft says it is planning more models based on the data pipeline created for MAI-Thinking-1 and its siblings.
We’re thinking: The company set out to train a reasoning model on fully attributable data but still drew heavily on the web. Of course everyone else does, too. Much of the success of large language models to date has been built on web data.
Better Reward Models for Robots
When you’re training a robot via reinforcement learning, a handcrafted reward function is labor-intensive to build but often dispenses rewards more effectively than a general-purpose reward model based on a vision-language model. Researchers built reward models that narrowed the gap.
What’s new: Tony Lee, Andrew Wagenmaker, Karl Pertsch, and colleagues at Stanford University and UC Berkeley built RoboReward, a family of vision-language reward models in 4 billion-parameter and 8 billion-parameter sizes. These models reward a variety of different tasks performed by a variety of robot types. The authors also provide a dataset and benchmark for training and evaluating vision-language reward models.
Key insight: Popular text-video datasets of robot actions mainly include examples of successful actions, which makes it hard for models to learn the difference between success and failure. But it’s possible to produce negative examples by relabeling positive examples (for example, given an example that includes a video of a robot putting a spoon in a pot, replace the command “put the spoon in the pot” with “put the spoon by the pot”). It’s also possible to produce incomplete attempts by trimming videos of successful actions.
How it works: The authors built a diverse robot-action dataset in which each example included a command, a video of a robot responding to the command, and a progress score from 1 (failed) to 5 (completed). They gathered videos from two datasets that depict single-arm, dual-arm, and humanoid robots. They standardized the task descriptions, augmented the data with negative examples, and assigned progress scores.
Results: The RoboReward models estimated rewards for examples in RoboRewardBench more accurately than the robotics model Gemini Robotics-ER 1.5 and generalist models including GPT-5. In a real-world robot demonstration, training based on rewards from RoboReward models resulted in better performance than training via previous reward models, though not better than training via human-assigned rewards.
Why it matters: Vision-Language reward models have been promising in training robots, and this approach makes them much more effective. By augmenting successful demonstrations with validated failures, the authors trained a general‑purpose reward model that works across various types of robots and tasks, alleviating the need for task‑specific engineering.
We’re thinking: Releasing a benchmark and pretrained reward models invites the community to improve reward functions directly rather than hoping they emerge as a side-effect of better generalist models.
Work With Andrew Ng
Join the teams that are bringing AI to the world! Check out job openings at DeepLearning.AI, AI Fund, and Landing AI.
Subscribe and view previous issues here.
Thoughts, suggestions, feedback? Please send to thebatch@deeplearning.ai. Avoid our newsletter ending up in your spam folder by adding our email address to your contacts list.
|