Dear friends,
The buzz over DeepSeek this week crystallized, for many people, a few important trends that have been happening in plain sight: (i) China is catching up to the U.S. in generative AI, with implications for the AI supply chain. (ii) Open weight models are commoditizing the foundation-model layer, which creates opportunities for application builders. (iii) Scaling up isn’t the only path to AI progress. Despite the massive focus on and hype around processing power, algorithmic innovations are rapidly pushing down training costs.
The business of training foundation models and selling API access is tough. Many companies in this area are still looking for a path to recouping the massive cost of model training. The article “AI’s $600B Question” lays out the challenge well (but, to be clear, I think the foundation model companies are doing great work, and I hope they succeed). In contrast, building applications on top of foundation models presents many great business opportunities. Now that others have spent billions training such models, you can access these models for mere dollars to build customer service chatbots, email summarizers, AI doctors, legal document assistants, and much more.
Scaling up isn’t the only path to AI progress. There’s been a lot of hype around scaling up models as a way to drive progress. To be fair, I was an early proponent of scaling up models. A number of companies raised billions of dollars by generating buzz around the narrative that, with more capital, they could (i) scale up and (ii) predictably drive improvements. Consequently, there has been a huge focus on scaling up, as opposed to a more nuanced view that gives due attention to the many different ways we can make progress. Driven in part by the U.S. AI chip embargo, the DeepSeek team had to innovate on many optimizations to run on less-capable H800 GPUs rather than H100s, leading ultimately to a model trained (omitting research costs) for under $6M of compute.
Keep learning, Andrew
A MESSAGE FROM DEEPLEARNING.AIDiscover Computer Use, which enables agents based on large language models to use desktop apps. In this free course, you’ll learn to apply Anthropic models to analyze on-screen images and control computers via mouse clicks and keystrokes. Start today!
News
Reinforcement Learning Heats UpReinforcement learning is emerging as an avenue for building large language models with advanced reasoning capabilities. What’s new: Two recent high-performance models, DeepSeek-R1 (and its variants including DeepSeek-R1-Zero) and Kimi k1.5, learned to improve their generated lines of reasoning via reinforcement learning. o1 pioneered this approach last year. Reinforcement learning (RL) basics: RL rewards or punishes a model for performing particular actions or achieving certain objectives. Unlike supervised and unsupervised learning, which compare the model's output to a known ground truth, RL doesn’t explicitly tell a model what it should output. Instead, the model starts out behaving randomly and discovers desired behaviors by earning rewards for its actions. This makes RL especially popular for training machine learning models that play games or control robots. How it works: To improve the chain of thought (CoT) generated by a large language model (LLM), reinforcement learning encourages the model to generate correct solutions to math, coding, science, and other problems that have known solutions. Unlike typical LLM training, in which the model simply generates the next token of its output and receives feedback token by token, this method rewards the model for generating a sequence of reasoning steps that lead to an accurate conclusion, even if doing so requires generating many intermediate tokens between the prompt and the response — to plan an outline, check the conclusion, or reflect on the approach — without explicit training on the reasoning steps to take.
Behind the news: While RL has been a staple technique for training models to play games and control robots, its role in developing LLMs has been confined to alignment with human preferences. Reinforcement learning to match judgements of humans (reinforcement learning from human feedback, or RLHF) or AI (Constitutional AI, which uses reinforcement learning from AI feedback or RLAIF) were the primary methods for encouraging LLMs to align with human preferences prior to the development of direct preference optimization. Why it matters: Reinforcement learning has surprising utility in training large language models to reason. As researchers press models into service in more complex tasks — math, coding, animated graphics, and beyond — reinforcement learning is emerging as an important path to progress. We’re thinking: Less than three years ago, reinforcement learning looked too finicky to be worth the trouble. Now it’s a key direction in language modeling. Machine learning continues to be full of surprising twists!
Computer Use Gains MomentumOpenAI introduced an AI agent that performs simple web tasks on a user’s behalf. What’s new: Operator automates online actions like buying goods, booking tickets and completing forms by navigating websites in a browser-like environment within ChatGPT. It’s available on desktops as a research preview for subscribers to ChatGPT Pro ($200 per month). OpenAI promises broader availability to come as well as API access to the underlying model and improved ability to coordinate multi-step tasks like scheduling meetings across calendars from different vendors. How it works: Operator uses a new model called Computer-Using Agent (CUA) that accepts text input and responds with web actions.
Behind the news: Operator rides a wave of agents designed to automate everyday tasks. Last week, OpenAI introduced ChatGPT Tasks, which lets users schedule reminders and alerts but doesn’t support web interaction. (Early users complained that Tasks was buggy and required overly precise instructions.) Anthropic’s Computer Use focuses on basic desktop automation, while DeepMind’s Project Mariner is a web-browsing assistant built on Gemini 2.0. Perplexity Assistant automates mobile apps such as booking Uber rides on Android phones. Why it matters: In early reports, users said Operator sometimes was less efficient than a human performing the same tasks. Nevertheless, agentic AI is entering the consumer market, and Operator is poised to give many people their first taste. It’s geared to provide AI assistance for an endless variety of personal and business uses, and — like ChatGPT was for other developers of LLMs — and it’s bound to serve as a template for next-generation products. We’re thinking: Computer use is maturing, and the momentum behind it is palpable. AI developers should have in their toolbox.
Learn More About AI With Data Points!AI is moving faster than ever. Data Points helps you make sense of it just as fast. Data Points arrives in your inbox twice a week with six brief news stories. This week, we covered ByteDance’s Doubao’s update and price cut and Atlas, a model that detects cancer and other diseases. Subscribe today!
White House Orders Muscular AI PolicyUnder a new president, the United States reversed its approach to AI regulation, seeking global dominance by reducing restrictions. What’s new: President Trump, who took office last week, signed an executive order that set a 180-day deadline to draft an AI Action Plan. The order aims to boost national security, economic competitiveness, and U.S. leadership in artificial intelligence. How it works: The executive order assigns responsibility for crafting the AI Action Plan to three key figures in the administration: Michael Kratsios, assistant to the president for science and technology (and former managing director of Scale AI); venture capitalist David Sacks, the new special advisor for AI and cryptocurrency; and national security advisor Michael Waltz.
AI infrastructure build-out: Along with the executive order, President Trump announced Stargate, a joint venture that involves OpenAI, Oracle, and SoftBank. The three companies outlined a plan to invest $100 billion in computing infrastructure for AI, such as next-generation data centers, and $500 billion over four years. In addition, the administration declared a national energy emergency with respect to U.S. supplies of energy and issued an order to ramp up domestic energy production. These measures aim to support energy-intensive AI initiatives like Stargate by removing regulatory barriers to building oil, gas, and renewable energy projects on federal lands. Why it matters: The Trump administration says that Biden’s 2023 regulations were “onerous and unnecessary,” stifled innovation, and jeopardized U.S. leadership in AI. The new order reduces bureaucratic oversight of AI development, creating a more permissive regulatory environment (except when it comes to ideological bias). We’re thinking: The Biden administration’s 2023 executive order aimed to guard against hypothetical, rather than actual, AI risks. It introduced thresholds of processing used to train models as a measure of their risk — a poorly thought-out proxy. To be fair, the AI Safety Institute under the U.S. National Institute of Standards and Technology didn’t hamper AI progress as much as some had feared, but overall the order was not helpful to AI innovation or safety. We’re pleased that the new administration is focusing on AI progress rather than hypothetical risks.
Fine-Tuning Fine PointsThe practice of fine-tuning models on synthetic data is becoming well established. But synthetic training data, even if it represents the training task well, may include characteristics like toxicity that impart unwelcome properties in the trained model’s output, and it may inconsistently represent desired traits such as the target output length. Researchers developed a method that reduces aspects of generated data and retains desired ones.
Results: Fine-tuning on the best response for each characteristic improved performance with respect to that characteristic beyond using the initial outputs or selecting outputs randomly.
Why it matters: Training on synthetic data is becoming increasingly common. While it shows great promise, best practices for data generation are still being formulated. The authors’ method helps by automatically steering models toward generating more desirable responses, reducing negative traits and reinforcing positive traits.
Work With Andrew Ng
Join the teams that are bringing AI to the world! Check out job openings at DeepLearning.AI, AI Fund, and Landing AI.
Subscribe and view previous issues here.
Thoughts, suggestions, feedback? Please send to thebatch@deeplearning.ai. Avoid our newsletter ending up in your spam folder by adding our email address to your contacts list.
|