Dear friends,
A “10x engineer” — a widely accepted concept in tech — purportedly has 10 times the impact of the average engineer. But we don’t seem to have 10x marketers, 10x recruiters, or 10x financial analysts. As more jobs become AI enabled, I think this will change, and there will be a lot more “10x professionals.”
10x engineers don’t write code 10 times faster. Instead, they make technical architecture decisions that result in dramatically better downstream impact, they spot problems and prioritize tasks more effectively, and instead of rewriting 10,000 lines of code (or labeling 10,000 training examples) they might figure out how to write just 100 lines (or collect 100 examples) to get the job done.
Similarly, 10x recruiters won’t just use generative AI to help write emails to candidates or summarize interviews. (This level of use of prompting-based AI will soon become table stakes for many knowledge roles.) They might coordinate a suite of AI tools to efficiently identify and carry out research on a large set of candidates, enabling them to have dramatically greater impact than the average recruiter. And 10x analysts won’t just use generative AI to edit their reports. They might write code to orchestrate a suite of AI agents to do deep research into the products, markets, and companies, and thereby derive far more valuable conclusions than someone who does research the traditional way.
Keep learning! Andrew
A MESSAGE FROM DEEPLEARNING.AILearn in detail how transformer-based large language models work in this new course by the authors of Hands-On Large Language Models. Explore the architecture introduced in the paper “Attention Is All You Need,” and learn through intuitive explanations and code examples. Join in for free
News
Reasoning in High GearOpenAI introduced a successor to its o1 models that’s faster, less expensive, and especially strong in coding, math, and science. What’s new: o3-mini is a large language model that offers selectable low, medium, and high levels of reasoning “effort.” These levels consume progressively higher numbers of reasoning tokens (specific numbers and methods are undisclosed), and thus greater time and cost, to generate a chain of thought. It’s available to subscribers to ChatGPT Plus, Team, and Pro, as well as to higher-volume users of the API (tiers 3 through 5). Registered users can try it via the free ChatGPT service by selecting “reason” in the message composer or selecting o3-mini before regenerating a response. How it works: o3-mini’s training set emphasized structured problem-solving in science and technology fields, and fine-tuning used reinforcement learning on chain-of-thought (CoT) data. Like the o1 family, it charges for tokens that are processed during reasoning operations and hides them from the user. (Competing reasoning models DeepSeek-R1, Gemini 2.0 Flash Thinking, and QwQ-32B-Preview make these tokens available to users.) o3-mini has a maximum input of 200,000 tokens and a maximum output of 100,000 tokens. Its knowledge cutoff is October 2023.
What they’re saying: Users praised o3-mini for its speed, reasoning, and coding abilities. They noted that it responds best to “chunkier” prompts with lots of context. However, due to its smaller size, it lacks extensive real-world knowledge and struggles to recall facts. Behind the news: Days after releasing o3-mini, OpenAI launched deep research, a ChatGPT research agent based on o3. OpenAI had announced the o3 model family in December, positioning it as an evolution of its chain-of-thought approach. The release followed hard upon that of DeepSeek-R1, an open weights model that captivated the AI community with its high performance and low training cost, but OpenAI maintained that the debut took place on its original schedule. Why it matters: o3-mini continues OpenAI’s leadership in language models and further refines the reasoning capabilities introduced with the o1 family. In focusing on coding, math, and science tasks, it takes advantage of the strengths of reasoning models and raises the bar for other model builders. In practical terms, it pushes AI toward applications in which it’s a reliable professional partner rather than a smart intern. We’re thinking: We’re glad that o3-mini is available to users of ChatGPT’s free tier as well as paid subscribers and API users. The more users become familiar with how to prompt reasoning models, the more value they’ll deliver.
Training for Computer UseAs Anthropic, Google, OpenAI, and others roll out agents that are capable of computer use, new work shows how underlying models can be trained to do this. What’s new: Yujian Qin and colleagues at ByteDance and Tsinghua University introduced UI-TARS, a fine-tuned version of the vision-language model Qwen2-VL that uses lines of reasoning to decide which mouse clicks, keyboard presses, and other actions to take in desktop and mobile apps. The model’s weights are licensed freely for commercial and noncommercial uses via Apache 2.0. You can download them here.
Behind the news: Adept touted computer use in early 2022, and OmniParser Aguvis soon followed with practical implementations. In October 2024, Anthropic set off the current wave of model/app interaction with its announcement of computer use for Claude 3.5 Sonnet. OpenAI recently responded with Operator, its own foray into using vision and language models to control computers. Results: UI-TARS matched or outperformed Claude 3.5 Sonnet with computer use, GPT-4o with various computer use frameworks, and the Aguvis framework with its native model on 11 benchmarks. On OSWorld, which asks models to perform tasks using a variety of real-world applications and operating systems, UI-TARS successfully completed 22.7 percent of the tasks in 15 steps, whereas Claude 3.5 Sonnet with computer use completed 14.9 percent, GPT-4o with Aguvis 17 percent, and Aguvis with its native model 10.3 percent. Why it matters: Training a model to take good actions enables it to perform well. Training it to correct its mistakes after making them enables it to recover from unexpected issues that may occur in the real world. We’re thinking: Since computer use can be simulated in a virtual machine, it’s possible to generate massive amounts of training data automatically. This is bound to spur rapid progress in computer use by large language models.
Learn More About AI With Data Points!AI is moving faster than ever. Data Points helps you make sense of it just as fast. Data Points arrives in your inbox twice a week with six brief news stories. This week, we covered Open-R1’s efforts to build a better training pipeline for reasoning models and how OpenAI’s Deep Research is bringing PhD-level analysis to ChatGPT. Subscribe today!
Gemini Thinks FasterGoogle updated the December-vintage reasoning model Gemini 2.0 Flash Thinking and other Flash models, gaining ground on OpenAI o1 and DeepSeek-R1. What’s new: Gemini 2.0 Flash Thinking Experimental 1-21 is a vision-language model (images and text in, text out) that’s trained to generate a structured reasoning process or chain of thought. The new version improves on its predecessor’s reasoning capability and extends its context window. It's free to access via API while it remains designated “experimental” and available to paid users of the Gemini app, along with Gemini 2.0 Flash (fresh out of experimental mode) and the newly released Gemini 2.0 Pro Experimental. The company also launched a preview of Gemini 2.0 Flash Lite, a vision-language model (images and text in, text out) that outperforms Gemini 1.5 Flash at the same price.
Speed bumps: Large language models that are trained to generate a chain of thought (CoT) are boosting accuracy even as the additional processing increases inference costs and latency. Reliable measures of Gemini 2.0 Flash Thinking Experimental 1-21’s speed are not yet available, but its base model runs faster (168.8 tokens per second with 0.46 seconds of latency to the first token, according to Artificial Analysis) than all models in its class except o1-mini (which outputs 200 tokens per second with 10.59 seconds of latency to the first token).
Okay, But Please Don’t Stop TalkingEven cutting-edge, end-to-end, speech-to-speech systems like ChatGPT’s Advanced Voice Mode tend to get interrupted by interjections like “I see” and “uh-huh” that keep human conversations going. Researchers built an open alternative that’s designed to go with the flow of overlapping speech. What’s new: Alexandre Défossez, Laurent Mazaré, and colleagues at Kyutai, a nonprofit research lab in Paris, released Moshi, an end-to-end, speech-to-speech system that’s always listening and always responding. The weights and code are free for noncommercial and commercial uses under CC-BY 4.0, Apache 2.0, and MIT licenses. You can try a web demo here. Key insight: Up to 20 percent of spoken conversation consists of overlapping speech, including interjections like “okay” and “I see.”
How it works: The authors combined an encoder-decoder called Mimi and an RQ-Transformer, which is made up of the Helium transformer-based large language model (LLM) plus another transformer.
Results: In tests, Moshi proved fast and relatively accurate.
Why it matters: While a turn-based approach may suffice for text input, voice-to-voice interactions benefit from a system that processes both input and output quickly and continuously. Previous systems process input and output separately, making users wait. Moshi delivers seamless interactivity. We’re thinking: Generating silence is golden!
Work With Andrew Ng
Join the teams that are bringing AI to the world! Check out job openings at DeepLearning.AI, AI Fund, and Landing AI.
Subscribe and view previous issues here.
Thoughts, suggestions, feedback? Please send to thebatch@deeplearning.ai. Avoid our newsletter ending up in your spam folder by adding our email address to your contacts list.
|