|
Dear friends,
Coding agents are accelerating different types of software work to different degrees. When we architect teams, understanding these distinctions helps us to have realistic expectations. Listing functions from most accelerated to least, my order is: frontend development, backend, infrastructure, and research.
Frontend development — say, building a web page to serve descriptions of products for an ecommerce site — is dramatically sped up because coding agents are fluent in popular frontend languages like TypeScript and JavaScript and frameworks like React and Angular. Additionally, by examining what they have built by operating a web browser, coding agents are now very good at closing the loop and iterating on their own implementations. Granted, LLMs today are still weak at visual design, but given a design (or if a polished design isn’t important), the implementation is fast!
Backend development — say, building APIs to respond to queries requesting product data — is harder. It takes more work by human developers to steer modern models to think through corner cases that might lead to subtle bugs or security flaws. Further, a backend bug can lead to non-intuitive downstream effects like a corrupted database that occasionally returns incorrect results, which can be harder to debug than a typical frontend bug. Finally, although database migrations can be easier with coding agents, they’re still hard and need to be handled carefully to prevent data loss. While backend development is much faster with coding agents, they accelerate it less, and skilled developers still design and implement far better backends than inexperienced ones who use coding agents.
Infrastructure. Agents are even less effective in tasks like scaling an ecommerce site to 10K active uses while maintaining 99.99% reliability. LLMs' knowledge is still relatively limited with respect to infrastructure and the complex tradeoffs good engineers must make, so I rarely trust them for critical infra decisions. Building good infrastructure often requires a period of testing and experimentation, and coding agents can help with that, but ultimately that’s a significant bottleneck where fast AI coding does not help much. Lastly, finding infrastructure bugs — say, a subtle network misconfiguration — can be incredibly difficult and requires deep engineering expertise. Thus, I’ve found that coding agents accelerate critical infrastructure even less than backend development.
Research. Coding agents accelerate research work even less. Research involves thinking through new ideas, formulating hypotheses, running experiments, interpreting them to potentially modify the hypotheses, and iterating until we reach conclusions. Coding agents can speed up the pace at which we can write research code. (I also use coding agents to help me orchestrate and keep track of experiments, which makes it easier for a single researcher to manage more experiments.) But there is a lot of work in research other than coding, and today’s agents help with research only marginally.
Categorizing software work into frontend, backend, infra, and research is an extreme simplification, but having a simple mental model for how much different tasks have sped up has been useful for how I organize software teams. For example, I now ask front-end teams to implement products dramatically faster than a year ago, but my expectations for research teams have not shifted nearly as much.
I am fascinated by how to organize software teams to use coding agents to achieve speed, and will keep sharing my findings in future letters.
Keep building! Andrew
A MESSAGE FROM DEEPLEARNING.AIIn “Building Multimodal Data Pipelines,” you’ll learn how to build pipelines that handle images, audio, and video from end to end. You’ll turn unstructured data into something you can query. Enroll for free
News
GLM 5.1 Aims for Long-Running Tasks
Z.ai updated its flagship open-weights large language model to work autonomously on single tasks for up to eight hours.
What’s new: GLM-5.1 is designed for coding and agentic tasks. Z.ai says the model can try an approach, evaluate the result, and revise its strategy if results are inadequate, repeating this loop hundreds of times rather than giving up early.
How it works: Z.ai has not published a technical report specific to GLM-5.1, which appears to follow GLM-5’s basic architecture, attention mechanism, pretraining, and input/output size limits. The key improvement is sustained productivity in long-running tasks.
Performance: GLM-5.1 achieved strong coding results among open-weights models but trailed proprietary models in tests of reasoning and math.
Price increase: Z.ai priced GLM-5.1 significantly higher than its predecessor. Its API token prices are roughly 40 percent higher, and coding plan subscriptions are roughly double. Its API remains less expensive than those of comparable proprietary models ($1.40 per million input tokens versus $5 per million for Claude Opus 4.6), but the gap is narrowing.
Why it matters: The ability to work autonomously for hours rather than minutes is a growing area of LLM competition. The lengths of tasks completed autonomously by AI agents have doubled roughly every seven months, according to METR, an independent testing organization, and Anysphere’s Cursor integrated development environment ran a swam of agents for a week. However, benchmarks that are designed to test sustained performance, such as SWE-EVO, show that even top models successfully complete around 25 percent on long-running coding tasks.
We’re thinking: If GLM-5.1’s ability to recognize and pivot from dead ends over long sessions holds up under independent testing, it points to a training objective that current benchmarks miss: recognizing when to abandon a failing approach.
Humanoid Robots Work Factory Floors
A small number of humanoid robots have made their way into industrial settings, where they’re roughly matching the cost of human labor and propelling some workers into higher-level roles.
What’s new: Agility Robotics, based in Oregon, is supplying humanoid robots to Schaeffler, a German maker of automotive parts in the first operational deployments of humanoid robots, The Wall Street Journal reported. Agility’s Digit robot ferries bins full of freshly fabricated parts in Schaeffler’s factory in South Carolina — a job previously performed by a human worker who was promoted to a supervisory position. Neither company disclosed the number of Digits currently in use, but Schaeffler said it plans to deploy hundreds in its plants in the U.S. and Europe by 2030.
How it works: At the Schaeffler factory, Digit carries 25-pound baskets from a stamping press to a conveyor belt, a traverse that takes about 1 minute to complete. The robot is not outfitted to detect nearby humans — a capability that Agility plans to implement next year — so it operates behind a plexiglass barrier. It works for two four-hour shifts with a break in between to recharge. The company has revealed few details about its technology including its processing hardware and AI models, datasets, or training methods.
Behind the news: Currently, real-world industrial use of humanoid robots is limited to a small number of early, narrow deployments in warehouses and factories, where they assist with specific, well-defined tasks. Most other humanoid systems in industry remain in pilot or trial phases. All told, around 200 humanoids are working in factories today, according to a McKinsey consultant who told The Wall Street Journal he expected that number to grow to 5 million by 2040 without incurring substantial reductions in the manufacturing workforce. Generally, research suggests that robots displace humans in specific tasks, driving a restructuring of jobs and upgrading of the remaining roles. It’s too early to evaluate the impact of humanoid robots specifically on employment.
Why it matters: Humanoid robots have become widely available only in the past few years, thanks to improvements in batteries, motors, and AI. Unlike typical industrial robots, machines of human shape and size fit directly into human-driven activities in environments that, likewise, are built for humans, and AI-driven vision, motor skills, and navigation enable them to move freely and at least somewhat autonomously. Schaeffler’s use of Digits in South Carolina — a step beyond pilot programs such as tests of Agility robots at Amazon and GXO Logistics and BMW’s trial of Figure’s humanoids — indicates that they are capable of economically useful work and may well take on labor currently performed by humans.
We’re thinking: If robotics research is an indication, lots of headroom remains to make humanoid robots more autonomous, interactive, and generally capable.
Learn More About AI With Data Points!
AI is moving faster than ever. Data Points helps you make sense of it just as fast. Data Points arrives in your inbox twice a week with six brief news stories. This week, we covered unauthorized access to Anthropic’s Claude Mythos model and growing skepticism around its cybersecurity claims. Subscribe today!
Anti-Data-Center Revolt Gains Traction
Resistance to new data centers is mounting across the United States.
What’s new: Opponents of data centers are registering their disapproval through legislative channels and, in two recent instances, through acts of violence. Objections to these facilities include their impact on electricity prices, consumption of electricity and water, noise pollution, proximity to residential neighborhoods, and sprawling size. Around $64 billion worth of data-center projects have been blocked or delayed amid local opposition between May 2024 and March 2025, one research group estimates.
How it works: Some of this resistance is being expressed through democratic channels.
Violent responses: Antipathy toward data centers has been implicated in violence in at least two cases.
Behind the news: Lack of transparency around some of the projects is a key gripe of opponents. In the Missouri development, for example, the operator of the data center has not been publicly identified. Critics also point to the environmental footprint of the facilities, particularly their noise levels, large square footage, energy demands, and water consumption. However, newer data centers have more environmentally friendly designs, such as more water-efficient closed-loop systems to cool their servers. Further, an increasing number of data centers supply their own power through privately owned, off-the-grid power plants.
Why it matters: The rapid growth of AI has led to surging demand for data centers, but electricity is emerging as a key constraint in some regions. Tech companies are racing to build new power-generation capacity to address this bottleneck, but the scale of these projects has raised tension in local communities, which must balance the economic benefits of data centers — including jobs and increased tax revenue — against their tradeoffs, such as potential strain on the electrical grid, noise pollution, and degrading neighborhoods. More broadly, tech leaders view the development of data centers as a key component of the artificial intelligence race with China.
We’re thinking: Some data center operators are more responsible than others. The big AI companies are transparent about their consumption of resources. Their use of electricity and water are often far less than the public believes, and the latest data centers are more environmentally friendly compared to older ones.
Assistants That Assist Consistently
Typically, large language models are trained to act as helpful, harmless, honest assistants. However, during long or emotionally charged conversations, traits can emerge that are less beneficial. Researchers devised a way to steady the assistant personas of LLMs.
What’s new: Christina Lu and colleagues at ML Alignment & Theory Scholars Program (an independent academic fellowship that matches researchers with mentors), University of Oxford, and Anthropic defined the assistant axis, a vector based on a model’s layer outputs that shows how closely it adheres to its trained-in assistant character. The team developed a method to correct deviations from this vector.
Key insight: Earlier work extracted persona vectors from LLM layer outputs that correspond to particular character traits: helpfulness, optimism, humor, sycophancy, evil, and so on. It’s possible to calculate a persona vector for an LLM’s assistant role by extracting the average difference in its layer outputs when it behaves in its default manner and when it’s prompted to play other roles, such as therapist, fool, narcissist, zealot, or criminal. The similarity between the difference vector — which the authors call the assistant axis — and the persona vector at any given moment reveals how far the LLM has drifted from its assistant role, a situation that can lead some users into dangerous situations. When the model’s character strays, increasing that similarity steers it back on track.
How it works: The team explored deviations from the default character of Gemma 2 27B, Qwen3 32B, and Llama 3.3 70B. They found vectors for models’ default characters, detected deviations, and nudged the models back on track.
Results: Activation capping effectively kept models in their assistant role, and it did so without degrading performance on a variety of benchmarks.
Why it matters: Alignment training teaches LLMs to behave like assistants, but it tethers them to that behavior only loosely. Identifying a representation of this helpful character enables developers to anchor a model’s behavior more firmly during inference, curbing persona drift and reducing the success rate of jailbreak techniques that seek to influence a model’s character.
We’re thinking: Beyond alignment training, system prompts act as behavioral guardrails, but motivated users can bypass them. Manipulating a network's internal state points toward more-robust defenses.
Work With Andrew Ng
Join the teams that are bringing AI to the world! Check out job openings at DeepLearning.AI, AI Fund, and Landing AI.
Subscribe and view previous issues here.
Thoughts, suggestions, feedback? Please send to thebatch@deeplearning.ai. Avoid our newsletter ending up in your spam folder by adding our email address to your contacts list.
|