Dear friends,
There’s a new breed of GenAI Application Engineers who can build more-powerful applications faster than was possible before, thanks to generative AI. Individuals who can play this role are highly sought-after by businesses, but the job description is still coming into focus. Let me describe their key skills, as well as the sorts of interview questions I use to identify them.
Skilled GenAI Application Engineers meet two primary criteria: (i) They are able to use the new AI building blocks to quickly build powerful applications. (ii) They are able to use AI assistance to carry out rapid engineering, building software systems in dramatically less time than was possible before. In addition, good product/design instincts are a significant bonus.
Now we have highly agentic coding assistants such as OpenAI’s Codex and Anthropic’s Claude Code (which I really enjoy using and find impressive in its ability to write code, test, and debug autonomously for many iterations). In the hands of skilled engineers — who don’t just “vibe code” but deeply understand AI and software architecture fundamentals and can steer a system toward a thoughtfully selected product goal — these tools make it possible to build software with unmatched speed and efficiency.
I find that AI-assisted coding techniques become obsolete much faster than AI building blocks, and techniques from 1 or 2 years ago are far from today's best practices. Part of the reason for this might be that, while AI builders might use dozens (hundreds?) of different building blocks, they aren’t likely to use dozens of different coding assistance tools at once, and so the forces of Darwinian competition are stronger among tools. Given the massive investments in this space by Anthropic, Google, OpenAI, and other players, I expect the frenetic pace of development to continue, but keeping up with the latest developments in AI-assisted coding tools will pay off, since each generation is much better than the last.
Keep building! Andrew
A MESSAGE FROM DEEPLEARNING.AIThe Data Analytics Professional Certificate is fully launched! In the “Data Storytelling” course, you’ll choose the right medium to present your analysis, design effective visuals, and learn techniques for aligning data insights with business goals. You’ll also receive guidance to build your portfolio and land a job in data analysis. Sign up
News
More Consistent Characters and Styles
Same character, new background, new action. That’s the focus of the latest text-to-image models from Germany’s Black Forest Labs.
What’s new: The FLUX.1 Kontext family, which comes in versions dubbed max, pro, and dev, is trained to alter images in controlled ways. The company plans to release the weights for FLUX.1 Kontext dev but has not yet specified the licensing terms.
How it works: The FLUX.1 Kontext models include encoders that embed input text and/or images, a transformer that processes them, and an image decoder that generates images. The current technical report doesn’t describe how it trained them for character consistency and image editing.
Results: The team compared the output of FLUX.1 Kontext models with that of five competing models including OpenAI GPT Image 1 (at three different quality levels) and Google Gemini 2.0 Flash native image generation. An undisclosed number of people evaluated the models according to a proprietary benchmark that highlights altering local and global aspects of an image, editing generated text within an image, maintaining consistent characters, and generating an image according to a reference style. The dataset included roughly 1,000 crowd-sourced pairs of text prompts and reference images.
Behind the news: Character consistency, also known as personalization, has come a long way since text-to-image generators became popular. In 2022, Textual Inversion showed how to learn an embedding of a character and use that embedding to produce further images. In 2023, DreamBooth showed how to get good results by fine-tuning a model on a few images of the character to be portrayed in a new situation. Since then, image-editing models have improved in quality and generality, including Meta Emu-Edit, OmniGen, and OpenAI gpt-image-1.
Why it matters: Consistency and precise editing enable artists to craft stories around specific characters. Such models have become better at generating consistent details across images, but they remain finicky, sometimes changing minute details or entire characters and backgrounds. The more faithfully they help users express their ideas, the more firmly embedded in the creative toolkit they’ll become.
We’re thinking: Black Forest Labs announced plans to publish its proprietary benchmark. There’s a real need for common benchmarks to evaluate image generation, and we hope other developers will give it due consideration.
AI Market Trends in Charts and Graphs
Renowned investment analyst Mary Meeker is back with a report on the AI market, six years after publishing her last survey of the internet.
What’s new: Meeker, co-founder of the venture capital firm Bond who formerly analyzed technology portfolios for Merrill Lynch, Salomon Brothers, and Morgan Stanley, published “Trends — Artificial Intelligence (May ‘25).” The report, which spans 340 graph-packed pages, revives and updates a series that chronicled the rise of the internet nearly every year from 1995 through 2019.
How it works: The new report focuses on a handful of themes that arise from the unprecedented growth and capabilities of deep learning. As Meeker told Axios, AI is an arena for “intense competition the likes of which we’ve never seen before,” and that makes the present time “a period for lots of wealth creation and wealth destruction.”
Behind the news: Meeker published her first “Internet Trends” report in 1995, anticipating the coming online boom, and she issued new editions annually throughout the 2000s and much of the coming decade. Her final internet report arrived in 2019, the year after she founded Bond, when the report highlighted the rise of visual social media like Instagram, wearable technology, and digital payments.
Why it matters: “Trends — Artificial Intelligence” offers a wealth of market data culled from analyst reports, consumer surveys, and academic studies. The AI community has a number of excellent annual surveys, including Stanford’s AI Index and Air Street Capital’s State of AI. Meeker, who has been watching technology markets since the dawning of the web, adds another valuable perspective.
We’re thinking: One implication of the report: There has never been a better time to build software applications. For developers, it’s time to hone and update skills. For tech companies, it’s time to cast the net for talent. As Meeker said in her interview with Axios, “Companies that get the best developers often win.”
Learn More About AI With Data Points!
AI is moving faster than ever. Data Points helps you make sense of it just as fast. Data Points arrives in your inbox twice a week with six brief news stories. This week, we covered the June preview release of Gemini 2.5 Pro. Subscribe today!
Benchmarking Costs Climb
An independent AI test lab detailed the rising cost of benchmarking reasoning models.
What’s new: Artificial Analysis, an organization that tracks model performance and cost, revealed its budgets for evaluating a few recent models that improve their output by producing chains of thought, which use extra computation and thus boost the cost of inference. The expense is making it difficult for startups, academic labs, and other organizations that have limited resources to reproduce results reported by model developers, TechCrunch reported. (Disclosure: Andrew Ng is an investor in Artificial Analysis.)
How it works: Artificial Analysis tested reasoning and non-reasoning models on popular benchmarks that gauge model performance in responding to queries that require specialized knowledge or multi-step reasoning, solving math problems, generating computer programs, and the like.
Behind the news: Generally, the cost per token of using AI models has been falling even as their performance has been rising. However, two factors complicate that trend. (i) Reasoning models produce more tokens and thus cost more to run, and (ii) developers are charging higher per-token prices to use their latest models. For example, o1-pro and GPT-4.5 (a non-reasoning model), both released in early 2025, cost $600 per million output tokens, while Claude 3.5 Sonnet (released in July 2024) costs $15 per million tokens of output. Emerging techniques that allow users to allocate numbers of tokens to reasoning (whether “high” or “low” or a specific tally) also make benchmarking more costly and complicated.
Why it matters: Benchmarks aren’t entirely sufficient for evaluating models, but they are a critical indicator of relative performance, and independent benchmarking helps to ensure that tests are run in a fair and consistent way. As the cost of benchmarking climbs, fewer labs are likely to confirm or challenge results obtained by the original developer, making it harder to compare models and recognize progress.
We’re thinking: Verifying performance claims in independent, open, fair tests is essential to marking progress in general and choosing the right models for particular projects. It's time for the industry to support independent benchmarking organizations.
Better Video, Fewer Tokens
Researchers reduced the number of tokens needed to represent video frames to be fed to a transformer.
What’s new: Jindong Jiang, Xiuyu Li, and collaborators at Nvidia, Rutgers University, UC Berkeley, Massachusetts Institute of Technology, Nanjing University, and Korea Advanced Institute of Science and Technology built STORM, a text-video system that performs well in tests of video understanding while processing fewer tokens.
Key insight: In a multimodal system, a large language model (LLM) that receives video tokens may struggle to process long videos. However, sequences of video frames often contain lots of redundancy, since few pixels may change from one frame to the next. Instead of forcing the LLM to process long sequences of redundant video tokens, mamba layers can enrich the token embeddings that represent one frame with information from other frames in the same clip. That way, the system can average token embeddings across frames without losing crucial information, making it possible to feed fewer tokens to the LLM without compromising performance.
How it works: The authors built STORM by training three components: (1) a pretrained SigLIP vision transformer, (2) untrained mamba layers, and (3) the pretrained large language model (LLM) from Qwen2-VL. They trained the system to predict the next token in image-text pairs and video-text pairs with 32-frame videos, and video-text pairs with 128-frame videos.
Results: STORM outperformed proprietary and open models on measures of video understanding.
Why it matters: STORM compresses video at the input to the LLM, so the LLM processes 1/8 as many video tokens and uses 1/8 as much compute to process them. This enables the system to work more than 3 times faster than the baseline while performing better.
We’re thinking: Initial work on the mamba architecture positioned it as a replacement for the transformer, but this work, along with other projects, combines them to get the benefits of both.
A MESSAGE FROM DEEPLEARNING.AIIn “Orchestrating Workflows for GenAI Applications” you’ll learn to orchestrate generative AI workflows using Apache Airflow 3.0. You’ll build and schedule RAG pipelines, run tasks in parallel, and add retries and alerts for reliability. No prior Airflow experience is needed! Enroll for free
Work With Andrew Ng
Join the teams that are bringing AI to the world! Check out job openings at DeepLearning.AI, AI Fund, and Landing AI.
Subscribe and view previous issues here.
Thoughts, suggestions, feedback? Please send to thebatch@deeplearning.ai. Avoid our newsletter ending up in your spam folder by adding our email address to your contacts list.
|