Dear friends,
At the Artificial Intelligence Action Summit in Paris this week, U.S. Vice President J.D. Vance said, “I’m not here to talk about AI safety. ... I’m here to talk about AI opportunity.” I’m thrilled to see the U.S. government focus on opportunities in AI. Further, while it is important to use AI responsibly and try to stamp out harmful applications, I feel “AI safety” is not the right terminology for addressing this important problem. Language shapes thought, so using the right words is important. I’d rather talk about “responsible AI” than “AI safety.” Let me explain.
Now, safety isn’t always a function only of how something is used. An unsafe airplane is one that, even in the hands of an attentive and skilled pilot, has a large chance of mishap. So we definitely should strive to build safe airplanes (and make sure they are operated responsibly)! The risk factors are associated with the construction of the aircraft rather than merely its application. Similarly, we want safe automobiles, blenders, dialysis machines, food, buildings, power plants, and much more.
“AI safety” presupposes that AI, the underlying technology, can be unsafe. I find it more useful to think about how applications of AI can be unsafe.
If we shift the terminology for AI risks from “AI safety” to “responsible AI,” we can have more thoughtful conversations about what to do and what not to do.
I believe the 2023 Bletchley AI Safety Summit slowed down European AI development — without making anyone safer — by wasting time considering science-fiction AI fears rather than focusing on opportunities. Last month, at Davos, business and policy leaders also had strong concerns about whether Europe can dig itself out of the current regulatory morass and focus on building with AI. I am hopeful that the Paris meeting, unlike the one at Bletchley, will result in acceleration rather than deceleration.
Keep building! Andrew
A MESSAGE FROM DEEPLEARNING.AIUnderstand and implement the attention mechanism, a key element in transformer-based LLMs, using PyTorch. In this course, StatQuest’s Josh Starmer explains the core ideas behind attention mechanisms, the algorithm itself, and a step-by-step breakdown of how to implement them in PyTorch. Enroll now
News
Agents Go DeepOpenAI introduced a state-of-the-art agent that produces research reports by scouring the web and reasoning over what it finds. What’s new: OpenAI’s deep research responds to users’ requests by generating a detailed report based on hundreds of online sources. The system generates text output, with images and other media expected soon. Currently the agent is available only to subscribers to ChatGPT Pro, but the company plans to roll it out to users of ChatGPT Plus, Team, and Enterprise. How it works: Deep research is an agent that uses OpenAI’s o3 model, which is not yet publicly available. The model was trained via reinforcement learning to use a browser and Python tools, similar to the way o1 learned to reason from reinforcement learning. OpenAI has not yet released detailed information about how it built the system.
Result: On a benchmark of 3,000 multiple-choice and short-answer questions that cover subjects from ecology to rocket science, OpenAI deep research achieved 26.6 percent accuracy. In comparison, DeepSeek-R1 (without web browsing or other tool use) achieved 9.4 percent accuracy and o1 (also without tool use) achieved 9.1 percent accuracy. On GAIA, questions that are designed to be difficult for large language models without access to additional tools, OpenAI deep research achieved 67.36 percent accuracy, exceeding the previous state of the art of 63.64 percent accuracy. Behind the news: OpenAI’s deep research follows a similar offering of the same name by Google in December. A number of open source teams have built research agents that work in similar ways. Notable releases include a Hugging Face project that attempted to replicate OpenAI’s work (not including training) in 24 hours (which achieved 55.15 percent accuracy on GAIA) and gpt-researcher, which implemented agentic web search in 2023, long before Google and OpenAI launched their agentic research systems. Why it matters: Reasoning models like o1 or o3 made a splash not just because they delivered superior results but also because of the impressive reasoning steps the model took to produce the results. Combining that ability with web search and tool use enables large language models to formulate better answers to difficult questions, including those whose answers aren’t in the training data or whose answers change over time. We’re thinking: Taking as much as 30 minutes of processing to render a response, OpenAI’s deep research clearly illustrates why we need more compute for inference.
Google Joins AI Peers In Military WorkGoogle revised its AI principles, reversing previous commitments to avoid work on weapons, surveillance, and other military applications beyond non-lethal uses like communications, logistics, and medicine. What’s new: Along with releasing its latest Responsible AI Progress Report and an updated AI safety framework, Google removed key restrictions from its AI principles. The new version omits a section in the previous document titled “Applications we will not pursue.” The deleted text pledged to avoid “technologies that cause or are likely to cause overall harm” and, where the technology risks doing harm, to “proceed only where we believe that the benefits substantially outweigh the risks” with “appropriate safety constraints.” How it works: Google’s AI principles no longer prohibit specific applications but promote developing the technology to improve scientific inquiry, national security, and the economy.
Behind the news: Google’s new stance reverses a commitment it made in 2018 after employees protested its involvement in Project Maven, a Pentagon AI program for drone surveillance, from which Google ultimately withdrew. At the time, Google pledged not to develop AI applications for weapons or surveillance, which set it apart from Amazon and Microsoft. Since then, the company has expanded its work in defense, building on a $1.3 billion contract with Israel. In 2024, Anthropic, Meta, and OpenAI removed their restrictions on military and defense applications, and Anthropic and OpenAI strengthened their ties with defense contractors such as Anduril and Palantir. Why it matters: Google’s shift in policy comes as AI is playing an increasing role in conflicts in Israel, Ukraine, and elsewhere, and while global geopolitical tensions are on the rise. While Google’s previous position kept it out of military AI development, defense contractors like Anduril, Northrop Grumman, and Palantir — not to mention AI-giant peers — stepped in. The new principles recognize the need for democratic countries to take the lead in developing technology and standards for its use as well as the massive business opportunity in military AI as governments worldwide seek new defense capabilities. Still, no widely accepted global framework governs uses of AI in combat.
Learn More About AI With Data Points!AI is moving faster than ever. Data Points helps you make sense of it just as fast. Data Points arrives in your inbox twice a week with six brief news stories. This week, we covered a new technique for building simple yet powerful reasoning models and how o3-mini topped the AIME 2025 math leaderboard. Subscribe today!
Alibaba’s Answer to DeepSeekWhile Hangzhou’s DeepSeek flexed its muscles, Chinese tech giant Alibaba vied for the spotlight with new open vision-language models. What’s new: Alibaba announced Qwen2.5-VL, a family of vision-language models (images and text in, text out) in sizes of 3 billion, 7 billion, and 72 billion parameters. The weights for all three models are available for download on Hugging Face, each under a different license: Qwen2.5-VL-3B is free for non-commercial uses, Qwen2.5-VL-7B is free for commercial and noncommercial uses under the Apache 2.0 license, and Qwen2.5-VL-72B is free to developers that have less than 100 million monthly active users. You can try them out for free for a limited time in Alibaba Model Studio, and Qwen2.5-VL-72B is available via the model selector in Qwen Chat. How it works: Qwen2.5-VL models accept up to 129,024 tokens of input according to the developer reference (other sources provide conflicting numbers) and generate up to 8,192 tokens of output. Alibaba has not released details about how it trained them.
Results: Alibaba reports Qwen2.5-VL-72B’s performance on measures that span image and text problems, parsing documents, understanding videos, and interacting with computer programs. Across 21 benchmarks, it beat Microsoft Gemini 2.0 Flash, OpenAI GPT-4o, Anthropic Claude 3.5 Sonnet, and open competitors on 13 of them (where comparisons are relevant and available).
More models: Alibaba also introduced competition for DeepSeek and a family of small models.
Why it matters: Vision-language models are getting more powerful and versatile. Not long ago, it was an impressive feat simply to answer questions about a chart or diagram that mixed graphics with text. Now such models are paired with an agent to control computers and smartphones. Broadly speaking, the Qwen2.5-VL models outperform open and closed competitors and they’re open to varying degrees (though the data is not available), giving developers a range of highly capable choices. We’re thinking: We’re happy Alibaba released a vision-language model that is broadly permissive with respect to commercial use (although we’d prefer that all sizes were available under a standard open weights license). We hope to see technical reports that illuminate Alibaba’s training and fine-tuning recipes.
Tree Search for Web AgentsBrowsing the web to achieve a specific goal can be challenging for agents based on large language models and even for vision-language models that can process onscreen images of a browser. While some approaches address this difficulty in training the underlying model, the agent architecture can also make a difference. What’s new: Jing Yu Koh and colleagues at Carnegie Mellon University introduced tree search for language model agents, a method that allows agents to treat web interactions like tree searches. In this way, agents can explore possible chains of actions and avoid repeating mistakes. Key insight: Some web tasks, for instance finding a price of a particular item, require a chain of intermediate actions: navigating to the right page, scrolling to find the item, matching an image of the item to the image on the page, and so on. If an agent clicks the wrong link during this process, it might lose its way. The ability to evaluate possible actions and remember previous states of web pages can help an agent correct its mistakes and choose a chain of actions that achieves its goal. How it works: An agent based on GPT-4o attempted 200 tasks using website mockups that mimicked an online retail store, Reddit-like forum, and directory of classified ads. The tasks included ordering an item to be delivered to a given address, finding specific images on the forum, and posting an ad. The authors annotated each web page using the method called Set of Mark, which identifies every visual element capable of interaction with a bounding box and a numerical ID.
Results: The authors compared two agents, one that followed their search method and another that started at the same page and received the same instruction but took one action per state and never backtracked. The agents attempted 100 shopping tasks, 50 forum tasks, and 50 classified-ads tasks. The one equipped to search successfully completed 26.4 percent of the tasks, while the other agent completed 18.9 percent of the tasks. Why it matters: Search joins reflection, planning, tool use, and multi-agent collaboration as an emerging agentic design pattern. Following many branching paths of actions enables an agent to determine the most effective set of actions to accomplish a task. We’re thinking: Agentic design patterns are progressing quickly! In combination with computer use, this sort of search method may enable agents to execute a wide variety of desktop tasks.
Work With Andrew Ng
Join the teams that are bringing AI to the world! Check out job openings at DeepLearning.AI, AI Fund, and Landing AI.
Subscribe and view previous issues here.
Thoughts, suggestions, feedback? Please send to thebatch@deeplearning.ai. Avoid our newsletter ending up in your spam folder by adding our email address to your contacts list.
|