Dear friends,
On Father’s Day last weekend, I sat with my daughter to help her practice solving arithmetic problems. To give her practice problems, I used OpenDevin, an open-source agentic coding framework, to write a Python script that generated questions that she enjoyed answering at her own pace. OpenDevin wrote the code much faster than I could have and genuinely improved my and my daughter’s day.
How can we test the code without requiring the user to write test cases? In a multi-agent system, each “agent” is an LLM prompted to play a particular role. An interesting result from AgentCoder shows that having separate agents for writing code and generating tests results in better performance than letting a single agent do both tasks. This is presumably because, if the agent writing the code is also responsible for writing the tests, the tests might be influenced by the code and fail to consider corner cases that the code does not cover.
When people think of testing code, many initially think of output testing, in which we see if the code generates the correct outputs to a specific set of test inputs. If the code fails a test, an LLM can be prompted to reflect on why the code failed and then to try to fix it. In addition to testing the output, the LDB method is helpful. LDB steps through the code and presents to the LLM values of the variables during intermediate steps of execution, to see if the LLM can spot exactly where the error is. This mimics how a human developer might step through the code to see where one of the computational steps went wrong, and so pinpoint and fix the problem. A lot of agentic workflows mimic human workflows. Similar to other work in machine learning, if humans can do a task, then trying to mimic humans makes development much easier compared to inventing a new process. However, the authors of SWE-agent noticed that many tools that humans use for coding are very inefficient for agents. For example, giving an agent access to a bash shell and having it find a piece of code by executing numerous cd, ls, and cat commands is inefficient, even though humans can do this rapidly. Similarly, visual coding editors like VSCode, emacs, and vim are easy for humans to use, but hard for LLMs (or LMMs) to navigate. Because agents interact with computers differently than humans do, the authors found that building special-purpose tools (functions) to let an agent search, view, and edit codebases resulted in better performance.
Keep coding! Andrew
A MESSAGE FROM DEEPLEARNING.AIDevelop an AI agent that interacts with tabular data and SQL databases using natural language prompts to simplify querying and extracting insights! Start learning for free
NewsMore New Open ModelsA trio of powerful open and semi-open models give developers new options for both text and image generation. What’s new: Nvidia and Alibaba released high-performance large language models (LLMs), while Stability AI released a slimmed-down version of its flagship text-to-image generator.
Why it matters: AI models that come with published weights are proliferating, and this week’s crop further extends the opportunity to build competitive AI applications. Nemotron-4 340B provides an exceptionally large model among open LLMs. Among smaller models, Qwen2-72B poses stiff competition for Llama 3-70B, which has energized the developer community since its May release. And Stable Diffusion 3 puts Stability AI’s image generation technology into the hands of developers working on edge devices. We’re thinking: Given the difficulty of acquiring high-quality data to train LLMs, and that the terms of service for many leading models prohibit generating data to train other models, Nvidia’s choice to equip Nemotron-4 to generate synthetic data is especially welcome. And it makes sense from a business perspective: Making it easier for developers to train their own LLMs may be good for GPU sales.
Private Benchmarks for Fairer TestsScale AI offers new leaderboards based on its own benchmarks. What’s new: Scale AI, which helps companies prepare and manage training data, introduced the Safety, Evaluations and Alignment Lab (SEAL) Leaderboards. Four leaderboards test models’ abilities to (i) generate code, (ii) work on Spanish-language inputs and outputs, (iii) follow detailed instructions, and (iv) solve fifth-grade math problems. The company currently tests 11 models from Anthropic, Google, Meta, Mistral, and OpenAI. Developers who want to have their model ranked can contact Scale AI via email.
Results: As of this writing, GPT-4 Turbo tops the Coding leaderboard with GPT-4o a very close second. GPT-4o tops the Spanish and Instruction Following leaderboards, just ahead of Gemini 1.5 Pro in Spanish and GPT-4 Turbo in Instruction Following. On the Math leaderboard, Claude 3 Opus holds a narrow lead over GPT-4 Turbo (second) and GPT-4o (third). Behind the news: As more models are trained on data scraped from the web, leakage of test data into training sets has made it more difficult to evaluate their performance on common benchmarks. Earlier this year, researchers at Shanghai Jiao Tong University evaluated 31 open-source large language models and found that several had a high probability of inaccurate benchmark results due to data leakage. Scale AI built the GSM1k math dataset partly to show that some high-profile language models show evidence of overfitting to the common math benchmark GSM8k. Why it matters: Traditionally, benchmarks have been open source efforts. But proprietary benchmarks are emerging to help developers evaluate their models and applications with greater confidence. By keeping their datasets under wraps, companies like Scale AI and Vals AI ensure that models haven’t been exposed to test questions and answers previously, making evaluations more reliable. However, private benchmarks lack the transparency of their open counterparts. A mix of public, private, and internal evals may be necessary to get a well rounded picture of a given model’s capabilities.
From Clip to CompositionIs your song’s verse in need of a chorus? A popular text-to-music generator can extend existing recordings while maintaining their musical character. What’s new: Paying users of Udio, a web service that generates pop-song productions from prompts, can upload audio clips and extend or alter them according to a text description. The service also increased its context window from 30 seconds to 2 minutes for more coherent output. You can hear the new capability here. Subscriptions start at $10 per month.
Behind the news: Udio competes with Suno, whose service also generates audio output with vocals, lyrics, and song structures. Also in the mix is Stability AI, whose Stable Audio 2.0 enables users to upload and extend brief instrumental recordings to a length of around three minutes. Why it matters: Udio is quickly becoming not just a song generator, but a song editor and builder. Just as the ability of text-to-image generators to edit, extend, and infill existing images made those applications more useful in a variety of creative situations, Udio’s audio-to-audio capabilities give composers and producers new horizons for enhancing, orchestrating, and structuring their own productions. We’re thinking: Udio offers impressive capabilities for musicians (and wanna-be musicians), but its developer tools are lacking. A public-facing API would enable producers to automate the service and integrate it with other applications.
For Faster Diffusion, Think a GANGenerative adversarial networks (GANs) produce images quickly, but they’re of relatively low quality. Diffusion image generators typically take more time, but they produce higher-quality output. Researchers aimed to achieve the best of both worlds. What's new: Axel Sauer and colleagues at Stability AI accelerated a diffusion model using a method called adversarial diffusion distillation (ADD). As the name implies, ADD combines diffusion with techniques borrowed from GANs and teacher-student distillation. Key insight: GANs are fast because they produce images in a single step. Diffusion models are slower because they remove noise from a noisy image over many steps. A diffusion model can learn to generate images in a single denoising step if, like a GAN, it learns to fool a discriminator, while the discriminator learns to identify generated output. The resulting one-step output doesn’t match the quality of multi-step diffusion, but distillation can improve it: While learning to fool the discriminator, the diffusion model (the student) can simultaneously learn to emulate the output of a different pretrained diffusion model (the teacher). How it works: The authors paired a pretrained Stable Diffusion XL (SDXL) generator (the student) with a pretrained DINOv2 vision transformer discriminator. The teacher was another pretrained Stable Diffusion XL with frozen weights. They didn’t specify the training dataset.
Results: The authors tested their method using 100 prompts from PartiPrompts. They compared the student’s output after either one or four denoising steps to a pretrained SDXL after 50 denoising steps. Human judges were asked which they preferred with respect to (i) image quality and (ii) alignment with the prompt. They preferred the student’s four-step images about 57 percent of the time for image quality and about 55 percent of the time for alignment with the prompt. They preferred SDXL to the student’s one-step images around 58 percent of the time for image quality and 52 percent of the time for alignment with the prompt. Why it matters: In this work, the key steps — having a student model learn from a teacher model, and training a generator against a discriminator — are established techniques in their own right. Combining them conferred upon the student model the advantages of both. We're thinking: With the growing popularity of diffusion models, how to reduce the number of steps they take while maintaining their performance is a hot topic. We look forward to future advances.
Work With Andrew Ng
Join the teams that are bringing AI to the world! Check out job openings at DeepLearning.AI, AI Fund, and Landing AI.
Subscribe and view previous issues here.
Thoughts, suggestions, feedback? Please send to thebatch@deeplearning.ai. Avoid our newsletter ending up in your spam folder by adding our email address to your contacts list.
|