|
Dear friends,
I just got back from AI Dev x NYC, the AI developer conference where our community gathers for a day of coding, learning, and connecting. The vibe in the room was buzzing! It was at the last AI Dev in San Francisco that I met up with Kirsty Tan and started collaborating with her on what became our AI advisory firm AI Aspire. In-person meetings can spark new opportunities, and I hope the months to come will bring more stories about things that started in AI Dev x NYC!
One special moment for me was when Nick Thompson, moderating a panel with Miriam Vogel and me, asked about governance. I replied that the United States’ recent hostile rhetoric toward immigrants is one of the worst moves it is making, and many in the audience clapped. Nick spoke about this moment in a video.
Keep building! Andrew
A MESSAGE FROM DEEPLEARNING.AIIn Semantic Caching for AI Agents, created with Redis, you’ll learn to make AI agents that become faster and more cost-effective over time by recognizing when different queries have the same meanings. Optimize accuracy and integrate caching into your agents. Enroll now!
News
Self-Driving Cars on U.S. Freeways
Waymo became the first company to offer fully autonomous, driverless taxi service on freeways in the United States.
What’s new: Waymo’s fleet is serving paying customers on high-speed roads in San Francisco and Los Angeles, California, and Phoenix, Arizona. The service is available to customers who have selected the Waymo app’s “freeway” preference, if the app determines that using freeways will result in a substantially faster trip.
How it works: Waymo, which operates thousands of vehicles in the San Francisco Bay Area, provided the most information about freeway service in that region. Its vehicles are plying the freeways that border roughly 260 square miles between San Francisco and San Jose, cutting ride times by as much as 50 percent.
Behind the news: Waymo has its roots in vehicles built by the Stanford Racing Team to compete in the DARPA Grand Challenge and DARPA Urban Challenge autonomous vehicle contests in the mid-2000s. Google adopted the project in 2009 and spun out Waymo as an independent company in late 2016.
Why it matters: Operating on freeways is critical for self-driving cars to function fully as alternatives to human-driven vehicles. Fully autonomous freeway driving is a significant technical step forward for Waymo, since its cars must shift smoothly from city driving to freeway driving, where conditions are less tightly controlled, and systems must plan farther ahead and react more quickly to adjust to changes at higher speed. In addition, obtaining government approval to put Waymo cars on freeways is a huge accomplishment from regulatory and social perspectives. The company managed to persuade regulators that the benefits of putting self-driving cars on freeways outweigh the potential costs, including threats to safety and public trust. Waymo’s aggressive plans for expansion suggest that this is the first of more milestones to come.
We’re thinking: Andrew still has his t-shirt from the DARPA Urban Challenge. He remembers the optimism of those days, and how much longer than early forecasts it has taken to develop roadworthy self-driving vehicles. Between Waymo’s robotaxis and Tesla’s Full Self-Driving (Supervised) capability, the question is not whether this technology will become commonplace but when.
Top Agentic Results, Open Weights
The latest open-weights large language model from Moonshot AI challenges top proprietary LLMs at agentic tasks by executing hundreds of tool calls sequentially and pausing to think between each.
What’s new: Kimi K2 Thinking and the faster Kimi K2 Thinking Turbo are trillion-parameter reasoning versions of Moonshot’s earlier LLM Kimi K2. They were fine-tuned at 4-bit (INT4) precision, so they can run at lower cost and on lower-cost hardware than other LLMs of similar size.
How it works: Rather than completing all reasoning steps before acting, Kimi K2 Thinking executes cycles of reasoning and tool use. This enables it to adjust continually depending on interim reasoning steps or results of tool calls.
Results: Kimi K2 Thinking leads open-weights LLMs in several benchmarks and achieves state-of-the-art results on some agentic tasks. However, it generates many more tokens than most competitors to achieve a comparable performance.
Yes, but: Kimi K2 Thinking used 140 million tokens to complete Artificial Analysis’ Intelligence Index evaluations, more than any other LLM tested, roughly 2.5 times the number used by DeepSeek-V3.2 Exp (62 million) and double that of GPT-5 Codex set to high reasoning (77 million). To run the Intelligence Index tests, Kimi K2 Thinking ($356) was around 2.5 times less expensive than GPT-5 set to high reasoning ($913), but roughly 9 times pricier than DeepSeek-V3.2 Exp ($41).
Behind the news: In July, Moonshot released the weights for Kimi K2, a non-reasoning version optimized for agentic tasks like tool use and solving problems that require multiple steps.
Why it matters: Agentic applications benefit from the ability to reason across many tool calls without human intervention. Kimi K2 Thinking is designed specifically for multi-step tasks like research, coding, and web navigation. INT4 precision enables the model to run on less expensive, more widely available chips — a boon especially in China, where access to the most advanced hardware is restricted — or at very high speeds.
We’re thinking: LLMs are getting smarter about when to think, when to grab a tool, and when to let either inform the other. According to early reports, Kimi K2 Thinking’s ability to plan and react helps in applications from science to web browsing and even creative writing — a task that reasoning models often don’t accomplish as well as their non-reasoning counterparts.
Learn More About AI With Data Points!
AI is moving faster than ever. Data Points helps you make sense of it just as fast. Data Points arrives in your inbox twice a week with six brief news stories. This week, we covered World Labs’ debut of Marble, a 3D world generator, and OpenAI’s new method for training sparse neural networks. Subscribe today!
Anthropic Cyberattack Report Sparks Controversy
Independent cybersecurity researchers pushed back on a report by Anthropic that claimed hackers had used its Claude Code agentic coding system to perpetrate an unprecedented automated cyberattack.
What’s new: In a blog post, Anthropic described thwarting a September campaign by hackers sponsored by the government of China, calling it the “first documented case of a large-scale cyberattack without substantial human intervention.” However, some independent researchers said that current agents are not capable of performing such nefarious feats, Ars Technica reported. Moreover, the success rate — a few successful attacks among dozens — belie Anthropic’s claim that the agentic exploit revealed newly dangerous capabilities. The lack of detail in Anthropic’s publications makes it difficult to fully evaluate the company’s claims.
Claude exploited: The hackers circumvented Claude Code’s guardrails by role-playing as employees of a security company who were testing its networks, according to Anthropic’s report.
Reasons for skepticism: Independent security researchers interviewed by Ars Technica , The Guardian, and others found a variety of reasons to question the report.
Behind the news: Hackers routinely use AI to expedite or automate their work, for instance writing more effective phishing emails or generating malicious code. In August, Anthropic highlighted the rise of “vibe hacking,” in which bad actors who have limited technical skills use AI to pursue nefarious activities previously undertaken only by more highly skilled coders. In August, Anthropic reported that it had disrupted one such effort, which involved the theft of personal data and extortion. In October, White House AI Czar David Sacks accused Anthropic of running a “sophisticated regulatory capture strategy based on fear-mongering.”
Why it matters: It stands to reason that AI can make hacking faster and more effective, just as it does many everyday activities. But Anthropic’s description of the Claude-powered agentic cyberattack it discovered is at odds with the experience of security researchers outside the company. Independent researchers have found agents relatively ineffective for automating cyberattacks and conventional methods equally or more dangerous. Security researchers are right to explore agentic AI both to perpetrate and defend against security threats, but it has not yet been found to pose the dire threat that Anthropic warns of.
We’re thinking: AI companies want to promote the power of their products, and sometimes — paradoxically — that promotion emphasizes a product’s powerful contribution to a negative outcome. Positive or negative, hype is harmful. We hope that makers of state-of-the-art models and applications based on them will find ways to drum up interest in their accomplishments — many of which are genuinely impressive and exciting! — without misleading or confusing the public. With respect to cybersecurity, AI-driven detection of security flaws makes it easier to patch them. In this way, AI helps to shift the balance of power from attackers to defenders, making computers more secure, not less.
More-Efficient Agentic Search
Large language models may have learned knowledge that’s relevant to a given prompt, but they don’t always recall it consistently. Fine-tuning a model to search its parameters as though it were searching the web can help it find knowledge in its own weights.
Results: The team evaluated SSRL on 6 question-answering benchmarks (Natural Questions, HotpotQA, and four others) and compared it to methods that use external search engines. Models trained via SSRL tended to outperform baselines that rely on search. The skills learned via SSRL also improved the model’s performance when it was equipped to call an external search engine.
Why it matters: The gap between training in a simulation and performance in the real world can be a challenge for AI agents based on LLMs. In this case, LLMs that were trained to simulate web searches were able to perform actual web searches more effectively. This result demonstrates that, for knowledge-based tasks, an LLM’s own parameters can serve as a cost-effective, high-fidelity simulator.
Work With Andrew Ng
Join the teams that are bringing AI to the world! Check out job openings at DeepLearning.AI, AI Fund, and Landing AI.
Subscribe and view previous issues here.
Thoughts, suggestions, feedback? Please send to thebatch@deeplearning.ai. Avoid our newsletter ending up in your spam folder by adding our email address to your contacts list.
|