Dear friends,

As AI agents accelerate coding, what is the future of software engineering? Some trends are clear, such as the Product Management Bottleneck, referring to the idea that we are more constrained by deciding what to build rather than the actual building. But many implications, like AI’s impact on the job market, how software teams will be organized, and more, are still being sorted out.

The theme of our AI Developer Conference on April 28-29 in San Francisco is The Future of Software Engineering. I look forward to speaking about this topic there, hearing from other speakers on this theme, and chatting with attendees about it. We’re shaping the future, and I hope you will join me there!

It is currently trendy in some technology and policy circles to forecast massive job losses due to AI. Even if they have not yet materialized, these losses certainly must be just over the horizon! I have a contrarian view that the AI jobpocalypse — the notion that AI will lead to massive unemployment, perhaps even rioting in the streets — won’t be nearly as bad as dire forecasts by pundits, especially pundits who are trying to paint a picture of how powerful their AI technology is.

Among professions, AI is accelerating software engineering most, given the rise of coding agents. According to a new report by Citadel Research, software engineering job postings are rising rapidly. So if software engineering is a harbinger of the impact AI will have on other professions, this expansion of software engineering jobs is encouraging.

Yes, fresh college graduates are having a hard time finding jobs. And yes, there have been layoffs that CEOs have attributed to AI, even if a large fraction of this was “AI washing,” where businesses choose to attribute layoffs to AI, even though AI has not changed their internal operations much yet. And yes, there is a subset of job roles, such as call center operator, that are more heavily impacted. Many people are feeling significant job insecurity, and I feel for everyone struggling with employment, whether or not the cause is AI-related. And many other factors, such as over-hiring during the pandemic and high interest rates, have contributed to the slowdown in the labor market, and the notion that AI is leading to unemployment is oversimplified.

In software engineering, I see a lot of exciting work ahead to adapt our workflows. It is already clear that: (i) As AI makes coding easier, a lot more people will be doing it. (ii) Writing code by hand and even reading (generated) code is not that important, because we can ask an LLM about the code and operate at a higher level than the raw syntax (although how high we can or should go is rapidly changing). (iii) There will be a lot more custom applications, because now it’s economical to write software for smaller and smaller audiences. (iv) Deciding what to build, more than the actual building, is becoming a bottleneck. (v) The cost of paying down technical debt is decreasing (since AI can refactor for you).

At the same time, there are also a lot of open questions for our profession, such as:

In the future, what will be the key skills of a senior software engineer? And for junior levels, what should be the new Computer Science curriculum?
If everyone can build features, what skills, strategies, or resources create competitive advantage for individuals and for businesses?
What are the new building blocks (libraries, SDKs, etc.) of software? How do we organize coding agents to create software?
What should a software team look like? For example, how many engineers, product managers, designers, and so on. What tooling do we need to manage their workflow?
How do AI agents change the workflow of machine learning engineers and data scientists? For example, how can we use agents to accelerate exploring data, identifying hypotheses, and testing them?

I’m excited to explore these and other questions about the future of software engineering at AI Dev. I expect this to be an exciting event. Please join us!

Keep building,

Andrew

A MESSAGE FROM DEEPLEARNING.AI

Promo banner for: "Efficient Inference with SGLang: Text and Image Generation"

New course available: Efficient Inference with SGLang: Text and Image Generation. Learn how LLM inference works and how to reduce cost and latency using KV cache and RadixAttention in SGLang. Apply the same principles to accelerate diffusion models and image generation. Enroll now

News

Table compares AI models' performance across benchmarks, showing Claude Mythos Preview leading.

Claude Mythos Preview Raises Security Worries

Anthropic took unusual steps to prepare the world for a forthcoming large language model that it said poses extraordinary risks to cybersecurity.

What’s new: Claude Mythos Preview, which is not generally available, broadly outperforms the two-month-old Claude Opus 4.6, but it’s “strikingly capable” of identifying and exploiting vulnerabilities in existing code, Anthropic said. The company detailed its capabilities in a model card that fills 244 pages — the first time it has published a model card without making the model itself available commercially. Anthropic did not announce plans for a commercial release.

Precautions: To harden existing code against such capabilities, the company assembled a consortium called Project Glasswing that includes Amazon Web Services, Apple, CrowdStrike, Google, JPMorganChase, Linux Foundation, Microsoft, and Nvidia along with more than 40 other organizations. Anthropic is funding exclusive access for Glasswing members ($100 million worth of credits at $25/$125 per million input/output tokens) as well as $4 million in donations to organizations that are devoted to maintaining open source projects, so these organizations can discover and patch vulnerabilities in code they control before the model, or another one like it, becomes widely available. Anthropic promised to share what Glasswing does and learns.

Security risks: Anthropic didn’t train Claude Mythos Preview on security-related tasks. The model’s skills arose from training in coding, reasoning, and autonomous behavior.

In tests over the past month, Claude Mythos Preview autonomously discovered “thousands” of “high-severity” vulnerabilities in popular operating systems, web browsers, and other code. 99 percent of them remain to be addressed. (Anthropic has not yet reported how many have been validated.)
A flaw in the OpenBSD operating system made it possible to crash any OpenBSD host that responds over TCP, potentially shutting down corporate, government, and internet servers. It had gone undiscovered for 27 years before the model found it (among other OpenBSD vulnerabilities). It is now patched.
Claude Mythos Preview also discovered a chain of bugs in the Linux kernel that enabled it to achieve root access and take control of the system. This, too, has been patched.
Anthropic is concerned that bad actors will use next-generation models to attack software that controls critical systems of banking, medical care, logistics, energy, and transportation, posing serious risks to the individuals, the economy, and national defense. It envisions coordination among AI developers, software companies, security researchers, open source maintainers, and governments to discover and patch potential problems before users of next-generation models find them.

Performance: Claude Mythos Preview’s reported performance is impressive. In tests conducted by Anthropic, it substantially outperformed Claude Opus 4.6, OpenAI GPT-5.4, and Google Gemini 3.1 Pro on several popular benchmarks. The model card details the team’s efforts to minimize the impact of contamination of the model’s training data with benchmark test sets.

Running CyberGym, a cybersecurity benchmark designed to assess the ability of AI agents to exploit real-world software vulnerabilities, Claude Mythos Preview (83.1 percent) outperformed Opus 4.6 (66.6 percent).
On Terminal-Bench 2.0 (multi-step agentic coding tasks), Claude Mythos Preview (82 percent) exceeded next-best GPT-5.4 (75.1 percent).
On GPQA Diamond (answering graduate-level science questions), Claude Mythos Preview (94.5 percent) edged out next-best Gemini 3.1 Pro (94.3 percent).
On HLE (answering expert-level multidisciplinary questions designed to test reasoning), with access to tools, Claude Mythos Preview (64.7 percent) outdistanced next-best Claude Opus 4.6 (53.1 percent).
On the GraphWalks test of long-context search between 256,000 tokens and 1 million tokens, Claude Mythos Preview (80 percent) vastly outperformed next-best Claude Opus 4.6 (38.7 percent). Anthropic didn’t publish results for Gemini Pro 3.1.

Yes, but: Anthropic’s way of introducing Claude Mythos Preview — promoting safety worries while withholding access from all but a small number of selected parties — comes right out of OpenAI’s early playbook. In 2019, that company promoted GPT-2’s ability to generate plausible text while keeping the model under wraps, citing the danger of its ability to produce disinformation and spam. Of course, the world greeted GPT-3 and subsequent iterations with unprecedented enthusiasm, and society has adjusted to the foibles and limitations of subsequent large language models. Anthropic’s caution may be justified but, like OpenAI’s product-release strategy, it has elements of a publicity stunt.

Why it matters: As large language models become more capable of coding, they also become more capable of finding bugs and exploiting them. Anthropic says the forthcoming Claude Mythos Preview does this dramatically better than its predecessors, posing risks to critical software that keeps society running. As long as the model outperforms top competitors, the company has little to lose — and potentially much to gain (say, avoiding damage to its brand after making the world less secure, if nothing else) — by creating a buzz while it prepares to deploy the model for commercial use.

We’re thinking: In the long term, as coding agents become more capable, defenders will gain the upper hand, as easy identification of vulnerabilities results in more-secure systems. But navigating this transition will be tricky, since advanced attackers may use tools that defenders have not yet gotten around to using.

Pitfalls in Assistive Models for the Blind

People whose vision is impaired increasingly use AI to assess their own appearance, raising questions about the psychological impact of AI models that are trained on conventional standards of beauty.

What’s new: Milagros Costabel, a blind freelance journalist, wrote about her experiences using a vision-language model as a virtual mirror. Her article on BBC.com explores challenges and potential pitfalls of relying on AI to judge personal qualities that are largely subjective and individual.

How it works: Costabel uses Be My Eyes, a smartphone app that provides a voice chatbot based on GPT-4 Vision. (Users can request to speak with a human volunteer to address critical or difficult issues.) She acknowledges the benefit of greater independence but highlights the challenge for blind people, who have little choice but to trust AI’s interpretation of what it sees. “For many blind people interviewed for this article, the experience feels both empowering and disorienting at once,” she writes.

Using the app to apply skin-care products, Costabel finds that it “does more than simply describe an image — [it offers] critical feedback.” For instance, it said her skin “definitely doesn’t look like the almost perfect example of reflective skin,” and “maybe if your jaw was less elongated . . . your face would look a little more like what is objectively considered beautiful in your culture.”
A similar app developed by Envision AI includes an AI assistant to handle tasks like checking calendar entries, reading product labels, and describing surroundings. The company’s CEO says he was “surprised by the number of customers who use it to do their makeup or coordinate their outfits. Often the first question they ask is how they look.”
Costabel interviewed a blind 20-year-old man who used an AI model to help him select photos for a dating profile. However, he found that the model’s descriptions didn’t match his own understanding of his hair color and facial expressions. “This kind of thing can make you feel insecure,” he said.
Psychologists worry that AI-generated assessments of physical beauty can contribute to depression and anxiety. Blind people, who can’t independently evaluate AI’s judgements about visual input, may be especially vulnerable. “AI not only allows blind people to . . . [compare] themselves to descriptions of photos of other human beings, but also to what AI might consider the perfect version of them,” said Helena Lewis-Smith, a psychologist at University of Bristol.

Behind the news: A number of products aim to use vision-language models to assist visually impaired users. In addition to Be My Eyes and Envision AI, offerings include Microsoft Seeing AI, Aira Explorer, and navigation app Oko. Such apps increasingly connect with wearable devices. For instance, Envision Glasses and Ray-Ban Meta Smart Glasses (you can read a vision-impaired user’s report here) provide hands-free, real-time narration that describes surroundings, reads documents, and identifies specific faces.

Why it matters: AI applications that serve visually impaired users should be able to provide objective, factual interpretations of visual input, to the extent that it’s feasible. More broadly, truly accessible AI products must accommodate users who have no way to verify their output. This may require further technology development, and meanwhile keeping humans in the loop (as Be My Eyes, Aira Explorer, and others do) or providing certainty scores that help users modulate their trust in the model’s output.

We’re thinking: Building products of any kind requires empathy with users, but building AI products that help people to overcome sensory and other impairments requires exceptional empathy. Extensive testing in the real world and careful revisions based on user feedback will go a long way toward making products that help people both do and feel their best.

Woman using AI software on dual monitors, generating voice and landscape imagery in tech-filled home office.

Learn More About AI With Data Points!

AI is moving faster than ever. Data Points helps you make sense of it just as fast. Data Points arrives in your inbox twice a week with six brief news stories. This week, we covered how NeurIPS reversed restrictions on Chinese researchers after a boycott threat, highlighting rising tensions in global AI research. Subscribe today!

Diagram shows DNA analysis with interconnected devices, output types, and species-specific data.

Dark DNA Unveiled

An open-weights model could help scientists compare the impact of genetic variations, identify mutations that cause diseases, and develop treatments.

What’s new: AlphaGenome interprets the 98 percent of the human and mouse genomes that don’t code for proteins but regulate gene expression and other functions. It finds properties such as where in a DNA sequence a gene begins and ends; how much RNA it directs a cell to produce; and where, as a cell reads a gene, it skips over parts of the gene sequence, a process in which errors can cause a variety of diseases.

Input/output: 1 million DNA base pairs and organism type (human or mouse) in, roughly 6,000 human gene properties and 1,000 mouse gene properties out
Architecture: convolutional neural network (CNN) encoder, transformer, CNN decoder
Performance: Across 50 evaluations, AlphaGenome matched or exceeded earlier models in 47 of them
Availability: API, weights, and inference code freely licensed for noncommercial uses

How it works: The authors pretrained 64 models of identical architecture on gene sequences and their properties, and then distilled their knowledge into a single model. Thus AlphaGenome learned the aggregate performance of all 64 models. They pretrained the models on mouse and human DNA and gene properties in four large public datasets.

For all 64 models, given a DNA sequence of up to 1 million base pairs, a CNN produced an embedding of every 128 base pairs. A transformer processed the embeddings, enabling the model to learn relationships between base pairs in distant parts of the sequence, and a CNN decoder took the transformer’s output and generated various properties.
The models learned to generate the properties of genes within the input sequence via 19 loss terms. For instance, one term encouraged the model to match its predicted distribution of the amount of RNA produced with the ground-truth distribution, while another encouraged the model to classify each base pair based on whether a cell, while reading the sequence, would skip over it starting at that base pair.
In the distillation stage, AlphaGenome learned to generate the gene properties generated by the 64 models, using the same loss terms as before.

Results: The authors compared AlphaGenome to nine earlier models across two broad evaluations: finding properties of a gene sequence and predicting the effect of mutation (an alteration in the sequence) on those properties.

When finding gene properties, AlphaGenome outperformed previous models in 22 out of 24 cases.
When predicting the effect of mutations, it matched or exceeded previous models in 24 of 26 cases.
The authors also assessed AlphaGenome’s performance in a real-world situation. They took normal DNA and modified it to match the changes caused by the illness known as T-cell acute lymphoblastic leukemia (T-ALL). They fed the unmodified and modified sequences to AlphaGenome and compared its outputs. The model’s predicted changes in protein expression fit the known mechanism of T-ALL’s effect on cells.

Why it matters: As recently as 15 years ago, non-coding DNA was widely believed to have no function at all. Since then, probing its functions has required painstaking experimentation. AlphaGenome puts that research into a model that anyone can use to find connections between this genomic netherworld and biological processes. For instance, the model makes it practical to compare functional differences between normal and mutated genes, revealing information that could be valuable in medicine and other biological disciplines.

We’re thinking: The notion that most of the human genome was “junk DNA” was curious, and scientists have discovered that it does essential things. We may be about to learn just how much it can do.

GIF showing fluid simulation with patch jittering, followed by predicted vs. actual flows across time steps 0–38.

How Liquids and Gases Behave

Simulating complex physical systems through traditional numerical methods is slow and expensive, and simulations based on machine learning are usually specialized for a specific type of system, such as water in a pipe or atmosphere surrounding a planet. Researchers built a general, transformer-based model for liquids, gases, and plasmas.

What’s new: Michael McCabe and colleagues at Polymathic AI Collaboration, a multi-institution, multi-disciplinary lab for scientific AI, released Walrus, a 1.3 billion-parameter model that simulates how fluids move, interact, and change over time. The model is freely available under an MIT license.

Key insight: Models often fail to simulate chaotic systems, which are highly sensitive to initial conditions, over long time periods because errors compound over time. In transformers, these failures also stem from aliasing, in which errors compound in specific locations over multiple time steps. (The resulting artifacts resemble pixelation in image processing.) Randomly jittering, or time-shifting, the data at each time step before feeding it back into the model reduces these artifacts.

How it works: Walrus predicts the next state of a physical system given a sequence of previous states. It comprises (i) two encoders, one for 2D data like velocity and one for 3D data like volume, that compress previous snapshots of the physical system, or frames, into tokens; (ii) a split attention block that generates tokens that represent the next frame; and (iii) two decoders (2D and 3D) that turn those tokens into the next frame.

The authors pretrained the system to predict 63 physical properties (like density, pressure, and velocity) in the next frame of fluid motion. The training data included roughly 8 million 2D examples and 4 million 3D samples from two datasets that cover 19 physical domains like acoustics, astrophysics, and non-Newtonian fluids, which change their viscosity under pressure.
They fine-tuned the pretrained model using an additional 500,000 examples from three fluid dynamics datasets as well as data held out from the pre-training datasets.
To handle both 2D and 3D data, the system projects each 2D input into 3D space, treating 2D inputs as volume with a depth of one.
To prevent the accumulation of aliasing errors, it shifts input data by random amounts before encoding it and applies the inverse shift after it generates each token. This technique distributed errors during each time step instead of allowing them to pile up at particular locations.

Results: The authors compared Walrus to earlier physics models including MPP-AViT, Poseidon, and DPOT.

Walrus achieved the lowest variance-scaled root mean squared error (VRMSE) in 18 of the 19 domains for one-step predictions, such as the state of the velocity and temperature of a fluid that is heated from the bottom and cooled from the top.
Walrus reduced one-step error by an average of 63.6 percent compared to the best of the competing models.
Over 20 to 60 steps, Walrus achieved the lowest VRMSE in 12 out of 19 domains.
Jittering reduced long-term error in 89 percent of scenarios.

Why it matters: Walrus potentially accelerates simulations in fields like climate science, aerospace, and materials. Moreover, the authors’ jittering technique may improve vision and video generation models by suppressing artifacts that are common to transformer architectures. In fact, the pixel-like artifacts common to vision transformers led them to take this approach.

We’re thinking: Physics’ shift from specialized numerical solvers and special-purpose models to general-purpose transformers mirrors natural language processing’s evolution from task-specific models to LLMs. Just as LLMs learn to read and predict the most likely next words across a wide range of tasks and languages, transformers trained on diverse data appear to be able to predict the behavior of diverse materials in a wide array of domains.

Work With Andrew Ng

Join the teams that are bringing AI to the world! Check out job openings at DeepLearning.AI, AI Fund, and Landing AI.

Subscribe and view previous issues here.

Thoughts, suggestions, feedback? Please send to thebatch@deeplearning.ai. Avoid our newsletter ending up in your spam folder by adding our email address to your contacts list.