|
Dear friends,
As amazing as LLMs are, improving their knowledge today involves a more piecemeal process than is widely appreciated. I’ve written about how AI is amazing . . . but not that amazing. Well, it is also true that LLMs are general . . . but not that general. We shouldn’t buy into the inaccurate hype that LLMs are a path to AGI in just a few years, but we also shouldn’t buy into the opposite, also inaccurate hype that they are only demoware. Instead, I find it helpful to have a more precise understanding of the current path to building more intelligent models.
Or, to get a model to perform certain tasks, such as use a web browser, developers might go through an even more laborious process of creating many RL gyms (simulated environments) to let an algorithm repeatedly practice a narrow set of tasks.
A typical human, despite having seen vastly less text or practiced far less in computer-use training environments than today's frontier models, nonetheless can generalize to a far wider range of tasks than a frontier model. Humans might do this by taking advantage of continuous learning from feedback, or by having superior representations of non-text input (the way LLMs tokenize images still seems like a hack to me), and many other mechanisms that we do not yet understand.
Keep building! Andrew
A MESSAGE FROM DEEPLEARNING.AIMany agent failures trace back to invisible issues: unclear tool calls, silent reasoning errors, and changes that regress behavior. Our new course shows how to use Nvidia’s NeMo Agent Toolkit to add tracing, run repeatable evals, and deploy workflows with authentication and rate limiting, so your agents behave reliably in real environments. Enroll today
News
Coherent, Interactive Worlds
Runway’s GWM-1 family of video-generation models respond to user input in real time while producing scenes that remain consistent regardless of the camera’s position.
What’s new: Runway introduced GWM-1, a trio of “general world models” that were trained to understand how scenes behave, not just how scenes appear. GWM Worlds generates scenes, GWM Robotics produces synthetic data for training and testing robots, and GWM Avatars generates conversational characters with facial expressions and lip-synced speech. (In addition, the company added audio generation, audio editing, and multi-shot video editing capabilities to Gen-4.5, its flagship video generator.)
How it works: Unlike typical diffusion models that generate an entire video simultaneously by removing noise progressively over a number of steps, GWM-1 generates one frame at a time based on past frames and control inputs. This autoregressive approach enables the model to respond to control input in real time. Runway built each GWM-1 model by post-training Gen-4.5 on domain-specific data. The models take still images and text as input.
Behind the news: Until recently, world models, or models that predict the future state of an environment given certain actions taken within that environment, reflected fairly limited worlds. Upon its launch in early 2024, OpenAI’s Sora 1 generated video output that was impressive enough to inspire arguments over whether it qualified as a world model of the real world. Those arguments were premature, since Sora 1’s output, however photorealistic it was, was not consistent with real-world physics, for instance. But they presaged models like Google Genie 2, which produces 3D video-game worlds that respond to keyboard inputs in real time, and World Labs [Marble], which generates persistent, editable, reusable 3D spaces from text, images, and other inputs.
Why it matters: Runway is among several AI companies that are racing to build models that simulate coherent worlds including objects, materials, lighting, fluid dynamics, and so on. Such models have huge potential value in entertainment and augmented reality but also in industrial and scientific fields, where they can help to design new products and plan for future scenarios. GWM Robotics (aimed at robotics developers) and GWM Avatars (which may be useful in applications like tutoring or customer service) show that Runway’s ambitions extend beyond entertainment.
We’re thinking: The world-model landscape is dividing between models that produce videos with real-time control (Runway GWM Worlds, Google Genie 3, World Labs RTFM) and those that make exportable 3D spaces (World Labs Marble). These approaches target different applications: Real-time interactivity enables training loops in which agents could learn from immediate feedback, while exportable 3D assets feed activities like game development, in which developers may refine and reuse assets across projects.
Disney Teams Up With OpenAI
Disney, the entertainment conglomerate that owns Marvel, Pixar, Lucasfilm and its own animated classics from 101 Dalmatians to Zootopia, licensed OpenAI to use its characters in generated videos.
What’s new: Disney and OpenAI signed a 3-year exclusive agreement that lets OpenAI train its Sora social video-gen app to produce 30-second clips that depict characters like Mickey Mouse, Cinderella, Black Panther, and Darth Vader. Open AI will compensate Disney for uses of its characters at an undisclosed rate, and Disney will stream a selection of user-generated videos on its Disney+ streaming network. In addition, Disney bought a $1 billion stake in OpenAI.
How it works: Starting in early 2026, users of the Sora app — not to be confused with the underlying Sora model — will be able to generate clips that show more than 200 fictional Disney characters. The deal is not yet final and remains subject to negotiation and board approval.
Behind the news: Disney is one of the world’s largest media companies by revenue and OpenAI is a clear leader in AI, which makes their alliance especially significant. It serves as a carrot in a carrot-and-stick strategy as Disney and other top entertainment companies are suing AI companies for alleged violations of intellectual property. Top music labels took a similar approach to gain a measure of control over AI startups that focus on music generation.
Why it matters: Video generation is a powerful creative tool, and one that Hollywood would like to have at its disposal. At the same time, generated videos are engaging increasingly larger audiences, raising the question whether it will draw attention and revenue away from Hollywood productions. Disney is embracing a future of custom, user-created media featuring its intellectual property as both a revenue stream in its own right and a hedge against a diminishing audience for theatrical releases and home video. Its investment in OpenAI also lets it share in AI’s upside. Cooperation between movie makers and AI companies gives both parties greater latitude to create compelling products and expand the audiences for both entertainment and AI-powered services.
We’re thinking: Filmmakers and videographers increasingly understand: AI and the arts may seem antithetical at first glance, but they’re a natural fit.
Learn More About AI With Data Points!
AI is moving faster than ever. Data Points helps you make sense of it just as fast. Data Points arrives in your inbox twice a week with six brief news stories. This week, we covered the launch of a new open-source foundation for agentic AI and Google’s release of the Gemini Deep Research agent API for developers. Subscribe today!
OpenAI’s Answer to Gemini 3
OpenAI launched GPT-5.2 only weeks after its CEO Sam Altman reportedly issued a “code red” alarm in response to Google's Gemini 3.
What’s new: OpenAI added a suite of GPT-5.2 models to ChatGPT and its API: GPT-5.2 Pro for high accuracy (name in the API: gpt-5.2-pro), GPT-5.2 Thinking for multi-step tasks like coding and planning (gpt-5.2), and GPT-5.2 Instant for less-involved tasks (gpt-5.2-chat-latest). The company touts the new models as time savers in professional tasks like producing spreadsheets, presentations, or code.
How it works: OpenAI revealed few details about GPT-5.2’s architecture and training but said it made “improvements across the board, including in pretraining.”
Performance: According to the ARC leaderboards, GPT-5.2-Pro set new states of the art on ARC-AGI-1 and AGI-ARC-2 (abstract visual puzzles). It remains neck-and-neck with competitors on other independent tests.
Behind the news: GPT-5.2 arrived as OpenAI faces heightened competitive pressure. CEO Sam Altman had declared a “code-red” emergency — a level of alarm typically related to smoke and fire in a hospital — on December 1, soon after Google launched Gemini 3. He instructed employees to delay plans to add advertisements to ChatGPT and instead focus on improving the models. OpenAI executives deny that GPT-5.2 was rushed.
Why it matters: GPT-5.2’s gains in computational efficiency are stark. One year ago, achieving 88 percent on ARC-AGI-1 cost roughly $4,500 per task. GPT-5.2 Pro achieves 90.5 percent at around $12 per task, roughly 390 times less. Extended reasoning is becoming dramatically more accessible.
We’re thinking: Technical approaches that aren’t economically feasible today, say running hundreds of reasoning attempts per problem or deploying thousands of reasoning-heavy agents, are on track to become surprisingly affordable within a few years.
Adapting LLMs to Any Sort of Data
Enabling a pretrained large language model to process a data type other than text (say, images), possibly in a specialized domain (say, radiology), typically requires thousands to millions of examples that pair the other data (perhaps x-rays) with text. Researchers devised an approach that requires a small number of examples.
What’s new: Sample-Efficient Modality Integration (SEMI) enables an LLM to process any input data type in any specialized domain based on as few as 32 examples. Given a suitable, pre-existing encoder, a single projector plus a dynamic complement of LoRA adapters translates input embeddings into the LLM’s embedding space. Osman Batur İnce developed the method with colleagues at University of Edinburgh, Instituto de Telecomunicações, Instituto Superior Técnico, Universidade de Lisboa, and Unbabel, a machine translation company.
Key insight: Typically, adapting a large language model (LLM) to accept multimodal inputs requires training a separate projector for each data type and/or domain. But the ability to adapt to unfamiliar input data types/domains can be considered a general, learnable skill. A projector can learn this skill by training on data types/domains for which examples are plentiful. Then LoRA adapters can adjust it for new data types/domains for which few examples are available. Better yet, a separate network can generate LoRA adapters that adjust the projector to new data types/domains as needed.
How it works: The authors aimed to connect pre-existing, pretrained domain-specific encoders (CLIP for images, CLAP for audio, VideoCLIP-XL for video, and others) to a pretrained large language model (Llama 3.1 8B). To that end, they trained a projector (a vanilla neural network) plus a LoRA generator (a network made up of a single attention layer).
Results: The authors compared SEMI to training a projector from scratch; fine-tuning their projector; and fine-tuning their projector with a bespoke LoRA adapter using astronomical images from their own dataset, satellite images, IMU sensor data, and graphs of molecular structures plus appropriate pre-existing encoders. They measured performance using metrics that include CIDEr (higher is better), which gauges how well a generated caption matches various human-written ones.
Why it matters: Large language models are of limited use in many technical fields because little text-paired data is available and building large text-paired datasets is expensive. This work could accelerate adoption of AI in such fields by taking advantage of knowledge in data-rich domains to bootstrap AI training in data-poor ones.
We’re thinking: For AI models to generalize to novel data types, they usually need to be trained on diverse, high-quality data. To that end, it’s helpful to squeeze more learning out of less data.
Work With Andrew Ng
Join the teams that are bringing AI to the world! Check out job openings at DeepLearning.AI, AI Fund, and Landing AI.
Subscribe and view previous issues here.
Thoughts, suggestions, feedback? Please send to thebatch@deeplearning.ai. Avoid our newsletter ending up in your spam folder by adding our email address to your contacts list.
|