|
Dear friends,
Separate reports by the publicity firm Edelman and Pew Research show that Americans, and more broadly large parts of Europe and the western world, do not trust AI and are not excited about it. Despite the AI community’s optimism about the tremendous benefits AI will bring, we should take this seriously and not dismiss it. The public’s concerns about AI can be a significant drag on progress, and we can do a lot to address them.
To be clear, all of us working in AI should look carefully at both the benefits and harmful effects of AI (such as deepfakes polluting social media and biased or inaccurate AI outputs misleading users), speak truthfully about both benefits and harms, and work to ameliorate problems even as we work to grow the benefits. But hype about AI’s danger has done real damage to trust in our field. Much of this hype has come from leading AI companies that aim to make their technology seem extraordinarily powerful by, say, comparing it to nuclear weapons. Unfortunately, a significant fraction of the public has taken this seriously and thinks AI could bring about the end of the world. The AI community has to stop self-inflicting these wounds and work to win back society’s trust.
First, to win people’s trust, we have a lot of work ahead to make sure AI broadly benefits everyone. “Higher productivity” is often viewed by general audiences as a codeword for “my boss will make more money,” or worse, layoffs. As amazing as ChatGPT is, we still have a lot of work to do to build applications that make an even bigger positive impact on people’s lives. I believe providing training to people will be a key piece of the puzzle. DeepLearning.AI will continue to lead the charge on AI training, but we will need more than this.
Second, we have to be genuinely worthy of trust. This means every one of us has to avoid hyping things up or fear mongering, despite the occasional temptation to do so for publicity or to lobby governments to pass laws that stymie competing products (such as open source).
Keep building! Andrew
A MESSAGE FROM DEEPLEARNING.AIIn “Building Coding Agents with Tool Execution,” you’ll learn how coding agents reason, execute code in isolated sandboxes, manage files, and handle feedback loops, then apply it to hands-on examples like a data-analysis agent and a sandboxed Next.js web app. Enroll today
News
Open 3D Generation Pipeline
Meta’s Segment Anything Model (SAM) image-segmentation model has evolved into an open-weights suite for generating 3D objects. SAM 3 segments images, SAM 3D turns the segments into 3D objects, and SAM 3D Body produces 3D objects of any people among the segments. You can experiment with all three.
SAM 3: SAM 3 now segments images and videos based on input text. It retains the ability to segment objects based on input geometry (bounding boxes or points that are labeled to include or exclude the objects at those locations), like the previous version.
SAM 3D: This model generates 3D objects from images based on segmentation masks. By individually predicting each object in an image, it can represent the entire scene. It can also take in point clouds to improve its output.
SAM 3D Body: Meta released an additional model that produces 3D human figures from images. Input bounding boxes or masks can also determine which figures to produce, and an optional transformer decoder can refine the positions and shapes of human hands.
Why it matters: This SAM series offers a unified pipeline for making 3D models from images. Each model advances the state of the art, enabling more-accurate image segmentations from text, 3D objects that human judges preferred, and 3D human figures that also appealed to human judges. These models are already driving innovations in Meta’s user experience. For instance, SAM 3 and SAM 3D enable users of Facebook marketplace to see what furniture or other home decor looks like in a particular space.
We’re thinking: At the highest level, all three models learned from a similar data pipeline: Find examples the model currently performs poorly on, use humans to annotate them, and train on the annotations. According to Meta’s publications, this process greatly reduced the time and money required to annotate quality datasets.
Generated, Editable Virtual Spaces
Models that generate 3D spaces typically generate them as users move through them without generating a persistent world to be explored later. A new model produces 3D worlds that can be exported and modified.
What’s new: World Labs launched Marble, which generates persistent, editable, reusable 3D spaces from text, images, and other inputs. The company also debuted Chisel, an integrated editor that lets users modify Marble’s output via text prompts and craft spaces environments from scratch.
How it works: Marble accepts several media types and exports 3D spaces in a variety of formats.
Performance: Early users report generating game-like environments and photorealistic recreations of real-world locations.
Behind the news: Earlier generative models can produce 3D spaces on the fly, but typically such spaces can’t be saved or revisited interactively. Marble stands out by generating spaces that can be saved and edited. For instance, in October, World Labs introduced RTFM, which generates spaces in real time as users navigate through them. Competing startups like Decart and Odyssey are available as demos, and Google’s Genie 3 remains a research preview.
Why it matters: World Labs founder and Stanford professor Fei-Fei Li argues that spatial intelligence — understanding how physical objects occupy and move through space — is a key aspect of intelligence that language models can’t fully address. With Marble, World Labs aspires to catalyze development in spatial AI just as ChatGPT and subsequent large language models ignited progress in text processing.
We’re thinking: Virtual spaces produced by Marble are geometrically consistent, which may prove valuable in gaming, robotics, and virtual reality. However, the objects within them are static. Virtual worlds that include motion will bring AI even closer to understanding physics.
Learn More About AI With Data Points!
AI is moving faster than ever. Data Points helps you make sense of it just as fast. Data Points arrives in your inbox twice a week with six brief news stories. This week, we covered DeepSeek’s new V3.2 models with integrated reasoning and tools, and Claude Opus 4.5’s state-of-the-art coding and agent capabilities. Subscribe today!
Baidu’s Multimodal Bids
Baidu debuted two models: a lightweight, open-weights, vision-language model and a giant, proprietary, multimodal model built to take on U.S. competitors.
Ernie-4.5-VL-28B-A3B-Thinking: Baidu’s new open-weights model is based on the earlier Ernie-4.5-21B-A3B Thinking, a text-only MoE reasoning model, plus a 7 billion-parameter vision encoder to process images.It outperforms comparable and larger models on visual reasoning tasks. It can extract on-screen text and analyze videos across time, and it can call tools to zoom in on image details and search for related images.
Ernie-5.0: Baidu describes Ernie-5.0’s approach as natively multimodal, meaning it was trained on text, images, audio, and video together rather than fusing different media encoders after training or routing inputs to specialized models. It performs comparably to the similarly multimodal Google Gemini 2.5 or OpenAI GPT-5, according to Baidu.
Yes, but: Shortly after Ernie-5.0's launch, a developer reported that the model repeatedly called tools even after instruction not to. Baidu acknowledged the issue and said it was fixing it.
Why it matters: Ernie-4.5-VL-28B-A3B-Thinking offers top visual reasoning at the fraction of the cost of competing models, and more flexibility for fine-tuning and other commercial customizations. However, the long-awaited Ernie 5.0 appears to fall short of expectations. It matches top models on some visual tasks but stops short of the forefront (including Qwen3-Max and Kimi-K2-Thinking) on leaderboards like LM Arena. Pretraining on text, images, video, and audio together is a relatively fresh approach that could simplify current systems that piece together different encoders and decoders for different media types.
We’re thinking: Ernie-5.0 may outperform Gemini 2.5 and GPT-5, but Google and OpenAI have already moved on to Gemini 3 and GPT-5.1!
Coordinating Robot Teams
In factories, where teams of robotic arms work in tight spaces, their motions are programmed by hand to keep them from interfering with one another. Researchers automated this programming using graph neural networks trained via reinforcement learning.
What’s new: Matthew Lai, Keegan Go, and colleagues at Google DeepMind, University College London, and robotics software shop Intrinsic developed RoboBallet, a graph neural network that coordinates robotic arms.
Key insight: Coordinating several robot arms is computationally infeasible for traditional search-based planners that figure out how to reach a target by searching through possible joint movements while checking for collisions. Each additional robot or obstacle multiplies the number of possible configurations. A graph neural network can overcome this limitation by learning to produce synchronized, collision-free motions in large numbers of simulated setups with different robot placements, obstacles, and target positions.
How it works: RoboBallet is a graph neural network that takes as input positions and orientations of robots, obstacles, and targets and generates joint velocities for each arm from its current position to reach a target. The authors trained it entirely in simulation using the TD3 actor-critic algorithm, a reinforcement learning algorithm. They generated about 1 million simulated workspaces, each of which contained a team of 4 or 8 simulated 3-joint Franka Panda robotic arms attached to the sides of a table at random, 30 obstacle blocks placed at random, and 40 target positions/orientations per team. They rejected configurations that started in collision.
Results: The authors tested the trained model in the real world. For real-world tests, the authors generated graphs from the known geometry of a physical workspace, using robot placements and 3D meshes of obstacles.
Why it matters: RoboBallet shows that a learning algorithm can coordinate many robots working together in real-world setups, and it can do so after training exclusively in a simulation. In addition, the model is more robust. When a robot fails, hard-coded routines can’t adapt. By contrast, the graph neural network continuously tracks how robots, tasks, and obstacles relate. If a robot fails, it can adapt on the fly and revise its plan.
We’re thinking: Representing the world as a graph enforces a built-in structure to the data, tracking relative positions and relationships between objects. Other data structures don’t inherently provide relationships between objects, so a network learning from them would have to learn those relationships as well. Using a graph makes it easier for a network to learn how to perform a task, since it doesn’t need to learn those relationships.
Work With Andrew Ng
Join the teams that are bringing AI to the world! Check out job openings at DeepLearning.AI, AI Fund, and Landing AI.
Subscribe and view previous issues here.
Thoughts, suggestions, feedback? Please send to thebatch@deeplearning.ai. Avoid our newsletter ending up in your spam folder by adding our email address to your contacts list.
|