Dear friends,

There’s a new breed of GenAI Application Engineers who can build more-powerful applications faster than was possible before, thanks to generative AI. Individuals who can play this role are highly sought-after by businesses, but the job description is still coming into focus. Let me describe their key skills, as well as the sorts of interview questions I use to identify them.

Skilled GenAI Application Engineers meet two primary criteria: (i) They are able to use the new AI building blocks to quickly build powerful applications. (ii) They are able to use AI assistance to carry out rapid engineering, building software systems in dramatically less time than was possible before. In addition, good product/design instincts are a significant bonus.

AI building blocks. If you own a lot of copies of only a single type of Lego brick, you might be able to build some basic structures. But if you own many types of bricks, you can combine them rapidly to form complex, functional structures. Software frameworks, SDKs, and other such tools are like that. If all you know is how to call a large language model (LLM) API, that's a great start. But if you have a broad range of building block types — such as prompting techniques, agentic frameworks, evals, guardrails, RAG, voice stack, async programming, data extraction, embeddings/vectorDBs, model fine tuning, graphDB usage with LLMs, agentic browser/computer use, MCP, reasoning models, and so on — then you can create much richer combinations of building blocks.

The number of powerful AI building blocks continues to grow rapidly. But as open-source contributors and businesses make more building blocks available, staying on top of what is available helps you keep on expanding what you can build. Even though new building blocks are created, many building blocks from 1 to 2 years ago (such as eval techniques or frameworks for using vectorDBs) are still very relevant today.

AI-assisted coding. AI-assisted coding tools enable developers to be far more productive, and such tools are advancing rapidly. Github Copilot, first announced in 2021 (and made widely available in 2022), pioneered modern code autocompletion. But shortly after, a new breed of AI-enabled IDEs such as Cursor and Windsurf offered much better code-QA and code generation. As LLMs improved, these AI-assisted coding tools that were built on them improved as well.

Now we have highly agentic coding assistants such as OpenAI’s Codex and Anthropic’s Claude Code (which I really enjoy using and find impressive in its ability to write code, test, and debug autonomously for many iterations). In the hands of skilled engineers — who don’t just “vibe code” but deeply understand AI and software architecture fundamentals and can steer a system toward a thoughtfully selected product goal — these tools make it possible to build software with unmatched speed and efficiency.

Colorful LEGO bricks labeled for AI concepts: prompting, agentic, guardrails, evals, RAG, fine-tuning, computer use, async programming.

I find that AI-assisted coding techniques become obsolete much faster than AI building blocks, and techniques from 1 or 2 years ago are far from today's best practices. Part of the reason for this might be that, while AI builders might use dozens (hundreds?) of different building blocks, they aren’t likely to use dozens of different coding assistance tools at once, and so the forces of Darwinian competition are stronger among tools. Given the massive investments in this space by Anthropic, Google, OpenAI, and other players, I expect the frenetic pace of development to continue, but keeping up with the latest developments in AI-assisted coding tools will pay off, since each generation is much better than the last.

Bonus: Product skills. In some companies, engineers are expected to take pixel-perfect drawings of a product, specified in great detail, and write code to implement it. But if a product manager has to specify even the smallest detail, this slows down the team. The shortage of AI product managers exacerbates this problem. I see teams move much faster if GenAI Engineers also have some user empathy as well at basic skill at designing products, so that, given only high-level guidance on what to build (“a user interface that lets users see their profiles and change their passwords”), they can make a lot of decisions themselves and build at least a prototype to iterate from.

When interviewing GenAI Application Engineers, I will usually ask about their mastery of AI building blocks and ability to use AI-assisted coding, and sometimes also their product/design instincts. One additional question I've found highly predictive of their skill is, “How do you keep up with the latest developments in AI?” Because AI is evolving so rapidly, someone with good strategies for keeping up — such as reading The Batch and taking short courses 😃, regular hands-on practice building projects, and having a community to talk to — really does stay ahead of the game much better than those who have less-effective strategies (such as if social media were their main source of info about AI, which typically does not provide the depth needed to keep up).

Keep building!

Andrew

A MESSAGE FROM DEEPLEARNING.AI

Promo banner for: "Data Analytics Professional Certificate: Data Storytelling"

The Data Analytics Professional Certificate is fully launched! In the “Data Storytelling” course, you’ll choose the right medium to present your analysis, design effective visuals, and learn techniques for aligning data insights with business goals. You’ll also receive guidance to build your portfolio and land a job in data analysis. Sign up

News

The FLUX.1 Kontext family of image generators from Black Forest Labs edits images to remove or add objects, apply art styles, and extract details.

More Consistent Characters and Styles

Same character, new background, new action. That’s the focus of the latest text-to-image models from Germany’s Black Forest Labs.

What’s new: The FLUX.1 Kontext family, which comes in versions dubbed max, pro, and dev, is trained to alter images in controlled ways. The company plans to release the weights for FLUX.1 Kontext dev but has not yet specified the licensing terms.

Input/output: text, image in; image out
Architecture: Unspecified text encoders, convolutional neural network image encoder-decoder, transformer. FLUX.1 Kontext dev 12 billion parameters, other parameter counts undisclosed
Features: Character consistency, local and global alterations
Availability/price: FLUX.1 Kontext max and FLUX.1 Kontext pro available via FLUX Playground and various partners, $0.08 per image (FLUX.1 max) and $0.04 per image (FLUX.1 pro) via Fal, an image-generation platform.
Undisclosed: Parameter counts of FLUX.1 Kontext max and FLUX.1 Kontext pro, architecture of text encoders, training data, evaluation protocol, open-weights license

How it works: The FLUX.1 Kontext models include encoders that embed input text and/or images, a transformer that processes them, and an image decoder that generates images. The current technical report doesn’t describe how it trained them for character consistency and image editing.

The team trained the convolutional neural network encoder-decoder to reproduce images and to fool a discriminator (architecture and training unspecified) into classifying them as real.
Having frozen the encoders, they trained the transformer — given a time step, embedding of a text prompt, embedding of a reference image, and noisy image embedding — to remove the noise over a series of steps.
They further trained the transformer to encourage it to produce noise-free embeddings that a second discriminator would classify as representing real images. This process, a variant of adversarial diffusion distillation, helps reduce the number of steps needed to produce a good image embedding.

Results: The team compared the output of FLUX.1 Kontext models with that of five competing models including OpenAI GPT Image 1 (at three different quality levels) and Google Gemini 2.0 Flash native image generation. An undisclosed number of people evaluated the models according to a proprietary benchmark that highlights altering local and global aspects of an image, editing generated text within an image, maintaining consistent characters, and generating an image according to a reference style. The dataset included roughly 1,000 crowd-sourced pairs of text prompts and reference images.

FLUX.1 Kontext max and FLUX.1 Kontext pro outperformed all competing models.
FLUX.1 dev outperformed all except other family members and GPT Image 1 set to high or medium quality.

Behind the news: Character consistency, also known as personalization, has come a long way since text-to-image generators became popular. In 2022, Textual Inversion showed how to learn an embedding of a character and use that embedding to produce further images. In 2023, DreamBooth showed how to get good results by fine-tuning a model on a few images of the character to be portrayed in a new situation. Since then, image-editing models have improved in quality and generality, including Meta Emu-Edit, OmniGen, and OpenAI gpt-image-1.

Why it matters: Consistency and precise editing enable artists to craft stories around specific characters. Such models have become better at generating consistent details across images, but they remain finicky, sometimes changing minute details or entire characters and backgrounds. The more faithfully they help users express their ideas, the more firmly embedded in the creative toolkit they’ll become.

We’re thinking: Black Forest Labs announced plans to publish its proprietary benchmark. There’s a real need for common benchmarks to evaluate image generation, and we hope other developers will give it due consideration.

Charts drawn from Mary Meeker’s inaugural “Trends — Artificial Intelligence” report show global growth of AI users, rising AI job demand vs. decline in non-AI jobs, and AI revenue lagging compute costs.

AI Market Trends in Charts and Graphs

Renowned investment analyst Mary Meeker is back with a report on the AI market, six years after publishing her last survey of the internet.

What’s new: Meeker, co-founder of the venture capital firm Bond who formerly analyzed technology portfolios for Merrill Lynch, Salomon Brothers, and Morgan Stanley, published “Trends — Artificial Intelligence (May ‘25).” The report, which spans 340 graph-packed pages, revives and updates a series that chronicled the rise of the internet nearly every year from 1995 through 2019.

How it works: The new report focuses on a handful of themes that arise from the unprecedented growth and capabilities of deep learning. As Meeker told Axios, AI is an arena for “intense competition the likes of which we’ve never seen before,” and that makes the present time “a period for lots of wealth creation and wealth destruction.”

Rapid growth: Change in AI is happening faster than ever. Users of ChatGPT reached 1 million in 5 days — compared to the iPhone’s 74 days — and since then have rocketed to 800 million. Total capital expenditures of the six biggest technology companies (largely driven by AI) rose 63 percent to $212 billion between 2023 and 2024. Training datasets are growing 260 percent per year, processing power devoted to training is growing 360 percent per year, effective processing power is growing at 200 percent annually.
Revenues and costs: The economics of this new world are not straightforward. On one hand, revenue is soaring at giants like Amazon, Google, and Nvidia as well as startups like Scale AI. On the other hand, the cost of computation is rising steadily even as the cost per token of output falls precipitously. Meanwhile, rapid turnover of models and proliferation of open-source alternatives are wild cards for AI-powered businesses.
Rising performance: AI performance continues to increase. AI’s ability to complete the MMLU benchmark of language understanding outstripped human performance last year. This year, 73 percent of human testers classified responses generated by an LLM as human, according to one study. Synthetic images, video, and speech generation — all are increasingly capable of fooling human testers.
Emerging capabilities: Today’s AI is capable of writing and editing, tutoring, brainstorming, automating repetitive work, and providing companionship. Within five years, it will generate code as well as humans, create films and games, operate humanlike robots, and drive scientific discovery. Meeker forecasts that within 10 years, AI will conduct scientific research, design advanced technologies, and build immersive digital worlds.
Workforce implications: Industries most likely to be affected by AI include knowledge work, content creation, legal services, software development, financial services, customer service, drug discovery, and manufacturing. Employers are adopting AI to get a boost in workforce productivity that Stanford researchers estimate is an average 14 percent. Companies like Box, Duolingo, and Shopify are adopting an AI-first orientation, while AI-related job titles have risen 200 percent in the past two years.
AI gets physical: AI is having a profound impact on the physical world. Lyft’s and Uber’s market share fell around 15 percent while Waymo’s gained 27 percent over the past 18 months. AI-driven mineral exploration is boosting mine efficiency, and AI-powered agriculture is cutting the use of pesticides. And, sadly, AI-equipped attack drones are wreaking destruction upon Ukraine and elsewhere, even as they play a critical role in defense.

Behind the news: Meeker published her first “Internet Trends” report in 1995, anticipating the coming online boom, and she issued new editions annually throughout the 2000s and much of the coming decade. Her final internet report arrived in 2019, the year after she founded Bond, when the report highlighted the rise of visual social media like Instagram, wearable technology, and digital payments.

Why it matters: “Trends — Artificial Intelligence” offers a wealth of market data culled from analyst reports, consumer surveys, and academic studies. The AI community has a number of excellent annual surveys, including Stanford’s AI Index and Air Street Capital’s State of AI. Meeker, who has been watching technology markets since the dawning of the web, adds another valuable perspective.

We’re thinking: One implication of the report: There has never been a better time to build software applications. For developers, it’s time to hone and update skills. For tech companies, it’s time to cast the net for talent. As Meeker said in her interview with Axios, “Companies that get the best developers often win.”

Office team analyzing data and reports, with computers displaying graphs and charts in a modern workspace setting.

Learn More About AI With Data Points!

AI is moving faster than ever. Data Points helps you make sense of it just as fast. Data Points arrives in your inbox twice a week with six brief news stories. This week, we covered the June preview release of Gemini 2.5 Pro. Subscribe today!

Bar chart that compares the costs to run a suite of popular benchmarks on reasoning and non-reasoning AI models.

Benchmarking Costs Climb

An independent AI test lab detailed the rising cost of benchmarking reasoning models.

What’s new: Artificial Analysis, an organization that tracks model performance and cost, revealed its budgets for evaluating a few recent models that improve their output by producing chains of thought, which use extra computation and thus boost the cost of inference. The expense is making it difficult for startups, academic labs, and other organizations that have limited resources to reproduce results reported by model developers, TechCrunch reported. (Disclosure: Andrew Ng is an investor in Artificial Analysis.)

How it works: Artificial Analysis tested reasoning and non-reasoning models on popular benchmarks that gauge model performance in responding to queries that require specialized knowledge or multi-step reasoning, solving math problems, generating computer programs, and the like.

Running a group of seven popular benchmarks, OpenAI o1 (which produces chains of thought) produced more than 44 million tokens, while GPT-4o (which doesn’t take explicit reasoning steps) produced around 5.5 million tokens.
Benchmarking o1 cost $2,767, while benchmarking Anthropic Claude 3.7 Sonnet (which allows users to allocate a number of reasoning tokens per query; TechCrunch doesn’t provide the number in this case) cost $1,485. Smaller reasoning models are significantly less expensive: o3-mini (at high effort, which uses the highest number of reasoning tokens per query) cost $345, and o1-mini cost $141.
Non-reasoning models are less expensive to test. Evaluating GPT-4o cost $109, Claude 3.5 Sonnet was $81.
Artificial Analysis spent around $5,200 to test 12 reasoning models versus around $2,400 to test more than 80 non-reasoning models.

Behind the news: Generally, the cost per token of using AI models has been falling even as their performance has been rising. However, two factors complicate that trend. (i) Reasoning models produce more tokens and thus cost more to run, and (ii) developers are charging higher per-token prices to use their latest models. For example, o1-pro and GPT-4.5 (a non-reasoning model), both released in early 2025, cost $600 per million output tokens, while Claude 3.5 Sonnet (released in July 2024) costs $15 per million tokens of output. Emerging techniques that allow users to allocate numbers of tokens to reasoning (whether “high” or “low” or a specific tally) also make benchmarking more costly and complicated.

Why it matters: Benchmarks aren’t entirely sufficient for evaluating models, but they are a critical indicator of relative performance, and independent benchmarking helps to ensure that tests are run in a fair and consistent way. As the cost of benchmarking climbs, fewer labs are likely to confirm or challenge results obtained by the original developer, making it harder to compare models and recognize progress.

We’re thinking: Verifying performance claims in independent, open, fair tests is essential to marking progress in general and choosing the right models for particular projects. It's time for the industry to support independent benchmarking organizations.

Overview of the STORM pipeline: Mamba layers between the image encoder and LLM bridge the gap between visual and language representation while injecting the temporal information into the tokens. The processed tokens capture temporal history. This enables the system to reduce the number of image tokens without sacrificing essential information.

Better Video, Fewer Tokens

Researchers reduced the number of tokens needed to represent video frames to be fed to a transformer.

What’s new: Jindong Jiang, Xiuyu Li, and collaborators at Nvidia, Rutgers University, UC Berkeley, Massachusetts Institute of Technology, Nanjing University, and Korea Advanced Institute of Science and Technology built STORM, a text-video system that performs well in tests of video understanding while processing fewer tokens.

Key insight: In a multimodal system, a large language model (LLM) that receives video tokens may struggle to process long videos. However, sequences of video frames often contain lots of redundancy, since few pixels may change from one frame to the next. Instead of forcing the LLM to process long sequences of redundant video tokens, mamba layers can enrich the token embeddings that represent one frame with information from other frames in the same clip. That way, the system can average token embeddings across frames without losing crucial information, making it possible to feed fewer tokens to the LLM without compromising performance.

How it works: The authors built STORM by training three components: (1) a pretrained SigLIP vision transformer, (2) untrained mamba layers, and (3) the pretrained large language model (LLM) from Qwen2-VL. They trained the system to predict the next token in image-text pairs and video-text pairs with 32-frame videos, and video-text pairs with 128-frame videos.

SigLIP learned to turn each video frame into 256 image tokens.
Given a sequence of image tokens, mamba layers learned to process them in both directions – left-to-right and right-to-left – so each output token embedding encoded information from the entire video.
The system averaged the token embeddings of 4 consecutive frames, reducing by a factor of 4 the number of tokens processed by Qwen2-VL’s LLM.
Given the averaged token embeddings, Qwen2-VL LLM learned to predict the next word in the video’s associated text.
At inference, the system fed to the LLM the tokens that represented every second frame (a process the authors call temporal sampling), which further halved the input to the LLM.

Results: STORM outperformed proprietary and open models on measures of video understanding.

On MVBench, which asks multiple-choice questions about actions, object interactions, and scene transitions in 16-second videos, STORM achieved 70.6 percent accuracy. That’s better than GPT-4o (64.6 percent accuracy) and Qwen2-VL (67.0 percent accuracy). A baseline system (STORM’s SigLIP and Qwen2-VL LLM without mamba layers, averaging image tokens, and temporal sampling) achieved 69.5 percent.
On MLVU, which asks multiple-choice and open-ended questions about videos that range from 3 minutes to over 2 hours long, STORM reached 72.9 percent accuracy, topping GPT-4o (66.2 percent accuracy). The baseline model achieved 70.2 percent.

Why it matters: STORM compresses video at the input to the LLM, so the LLM processes 1/8 as many video tokens and uses 1/8 as much compute to process them. This enables the system to work more than 3 times faster than the baseline while performing better.

We’re thinking: Initial work on the mamba architecture positioned it as a replacement for the transformer, but this work, along with other projects, combines them to get the benefits of both.

A MESSAGE FROM DEEPLEARNING.AI

Promo banner for: "Orchestrating Workflows for GenAI Applications"

In “Orchestrating Workflows for GenAI Applications” you’ll learn to orchestrate generative AI workflows using Apache Airflow 3.0. You’ll build and schedule RAG pipelines, run tasks in parallel, and add retries and alerts for reliability. No prior Airflow experience is needed! Enroll for free

Work With Andrew Ng

Join the teams that are bringing AI to the world! Check out job openings at DeepLearning.AI, AI Fund, and Landing AI.

Subscribe and view previous issues here.

Thoughts, suggestions, feedback? Please send to thebatch@deeplearning.ai. Avoid our newsletter ending up in your spam folder by adding our email address to your contacts list.