Dear friends,
After a recent price reduction by OpenAI, GPT-4o tokens now cost $4 per million tokens (using a blended rate that assumes 80% input and 20% output tokens). GPT-4 cost $36 per million tokens at its initial release in March 2023. This price reduction over 17 months corresponds to about a 79% drop in price per year: 4/36 = (1 - p)17/12. (OpenAI charges a lower price, just $2 per million tokens, for using a new Batch API that takes up to 24 hours to respond to a batch of prompts. That’s an 87% drop in price per year.)
Further, hardware innovations by companies such as Groq (a leading player in fast token generation), Samba Nova (which serves Llama 3.1 405B tokens at an impressive 114 tokens per second), and wafer-scale computation startup Cerebras (which just announced a new offering), as well as the semiconductor giants NVIDIA, AMD, Intel, and Qualcomm, will drive further price cuts.
This means that even if you build an agentic workload that isn’t entirely economical, falling token prices might make it economical at some point. As I wrote previously, being able to process many tokens is particularly important for agentic workloads, which must call a model many times before generating a result. Further, even agentic workloads are already quite affordable for many applications. Let's say you build an application to assist a human worker, and it uses 100 tokens per second continuously: At $4/million tokens, you'd be spending only $1.44/hour – which is significantly lower than the minimum wage in the U.S. and many other countries.
So how can AI companies prepare?
Because multiple providers now host Llama 3.1 and other open-weight models, if you use one of these models, it might be possible to switch between providers without too much testing (though implementation details — specifically quantization, does mean that different offerings of the model do differ in performance). When switching between models, unfortunately, a major barrier is still the difficulty of implementing evals, so carrying out regression testing to make sure your application will still perform after you swap in a new model can be challenging. However, as the science of carrying out evals improves, I’m optimistic that this will become easier. Keep learning! Andrew
A MESSAGE FROM DEEPLEARNING.AIIn our short course “Large Multimodal Model Prompting with Gemini,” you’ll learn how to build systems that reason across text, images, and video and how prompting multimodal models differs from text-only LLMs. You’ll also optimize LMM systems and output. Enroll today!
News
A Lost Voice RegainedA man who lost the ability to speak four years ago is sounding like his earlier self, thanks to a collection of brain implants and machine learning models. What’s new: Researchers built a system that decodes speech signals from the brain of a man who lost the ability to speak clearly due to amyotrophic lateral sclerosis, also known as ALS, and enables him to speak through a synthetic version of his former voice. At the start of the study, his efforts to speak were intelligible only to his personal caregiver. Now he converses regularly with family and friends, The New York Times reported. Nicholas Card built the system with colleagues University of California-Davis, Stanford University, Washington University, Brown University, VA Providence Healthcare, and Harvard Medical School. How it works: The authors surgically implanted four electrode arrays into areas of the brain that are responsible for speech. The system learned to decode the patient’s brain signals, decide the most likely phonemes he intended to speak, determine the words those phonemes express, and display and speak the words aloud using a personalized speech synthesizer.
Results: After two hours of recording the patient’s brain signals and training on that data, the system achieved 90.2 percent accuracy in the copying task. By the final session, the system achieved 97.5 percent accuracy and enabled the patient to speak on average 31.6 words per minute using a vocabulary of 125,000 words. Behind the news: Previous work either had much lower accuracy or generated a limited vocabulary. The new work improved upon a 2023 study that enabled ALS patients to speak with 76.2 percent accuracy using a vocabulary of equal size. Why it matters: Relative to the 2023 study on which this one was based, the authors changed the positions of the electrodes in the brain and continued to update the GRU throughout the recording/training sessions. It’s unclear which changes contributed most to the improved outcome. As language models improve, new models potentially could act as drop-in replacements for the models in the authors’ system, further improving accuracy. Likewise, improvements in speech-to-text systems could increase the similarity between the synthetic voice and the patient’s former voice. We’re thinking: Enabling someone to speak again restores agency. Enabling someone to speak again in their own voice restores identity.
Agentic Coding Strides ForwardAn agentic coding assistant boosted the state of the art in an important benchmark by more than 30 percent. What’s new: Cosine, a startup based in London, unveiled Genie, a coding assistant that achieves top performance on SWE-bench, which tests a model’s ability to solve GitHub issues. The company has yet to announce pricing and availability, but a waitlist is available. How it works: Genie is a fine-tuned version of GPT-4o with a larger context window of undisclosed size. It works similarly to agentic coding tools like Devin, Q, OpenDevin, and SWE-agent. Its agentic workflow loops through four processes: retrieving information, planning, writing code, and running it. It was trained on a proprietary training set that captures software engineers’ processes for reasoning, gathering information, and making decisions. It edits lines of code in place rather than rewriting entire sections or files from scratch.
Results: Tested on SWE-bench Full (2,294 issue-commit pairs across 12 Python repositories), Genie solved 30.1 percent of problems, far ahead of the next closest competitor, Amazon Q, at 19.75 percent. Genie achieved 50.7 percent of the SWE-bench Lite (winnowed to 300 issue-commit pairs to save computation), beating CodeStory Aide plus other models at 43 percent. (Genie’s results don’t appear on the official SWE-bench leaderboard. The leaderboard requires that models document their workings, which Cosine declined to avoid revealing proprietary information. Cosine released Genie’s solution sets to verify its performance.) Behind the news: SWE-bench’s creators recently collaborated with OpenAI to produce a new version, SWE-bench Verified. They eliminated extremely difficult and poorly configured problems, leaving 500 human-verified issue-commit pairs. Cosine has yet to publish Genie’s performance on SWE-bench Verified. As of this writing, Amazon Q ranks in first place with 38.8 percent. Why it matters: Some developers of AI coding assistants train models to follow human-style procedures while others are building AI-native methods. Genie takes a distinct step forward by mimicking software engineers. Competition between the two approaches, along with longer context windows, faster inference, and increasingly sophisticated agentic workflows, is driving improvement of coding assistants at a rapid pace. We’re thinking: We’re glad this Genie escaped the bottle!
AI Lobby ExpandsAI is a red-hot topic for lobbyists who aim to influence government policies in the United States.
Yes, but: The lobbying disclosure forms show who is spending money to influence policy, but they provide only a limited view. For instance, they reveal only that an organization aimed to influence AI policy, not the directions in which they aimed to influence it. Similarly, the disclosures shed no light on other efforts to influence laws and regulations such as advertising or campaign contributions. They also don’t reveal how much an organization discussed AI relative to other topics and concerns. For instance, last year the American Medical Association spent $21.2 million on lobbying including AI but, given the wide range of policy issues involved in medicine, AI likely accounted for a small amount of the total.
Multimodal to the MaxResearchers introduced a model that handles an unprecedented number of input and output types, including many related to performing computer vision tasks. What’s new: Roman Bachmann, Oguzhan Fatih Kar, David Mizrahi and colleagues at EPFL and Apple built 4M-21, a system that works with 21 input and output types. These include modalities related to images, geometry, and text along with metadata and embeddings produced by other models. Key insight: The authors followed and extended their insight from the earlier 4M, which handles seven input and output types, as well as work such as Unified-IO 2, which handles 11. The key to training a model to handle multiple types of data input is to ensure that the training data takes the same format with the same-sized embedding across all input types. Using the transformer architecture, tokens suffice. How it works: 4M-21 comprises a large transformer and several encoder-decoders that convert different data types into tokens and back. The authors repeated their training strategy for 4M, but they increased the transformer’s size from 303 million parameters to 3 billion parameters, boosted the training dataset size from 400 million examples to 500 million examples, and incorporated new input types.
Results: 4M-21 demonstrated strong zero-shot performance in a variety of vision tasks. For instance, in estimating surface normals for each point in an image, 4M-21 achieved a 20.8 L1 score (average absolute difference between predicted and true values, lower is better), while the multimodal model UnifiedIO 2-XL achieved a 34.8 L1. In estimating an image’s depth map, 4M-21 achieved 0.68 L1, while UnifiedIO 2-XL achieved 0.86 L1. In semantic segmentation, 4M-21 reached 48.1 percent mean intersection over union (overlap between predicted and ground-truth segments divided by their union, higher is better), while UnifiedIO 2-XL achieved 39.7 percent mean intersection over union. Why it matters: Since 4M-21 learned to predict tokens of several modalities using tokens from other modalities, it isn’t limited to a single modalities as input. The authors demonstrate that it can generate new images conditioned by the combination of a caption and 3D human poses, edges, or metadata. We’re thinking: The authors say 4M-21 can take as input any combination of the modalities it’s trained to handle and output any of them. The limits of this capability aren’t clear, but it opens the door to fine control over the model’s output. The authors explain how they extracted the various modalities; presumably users can do the same to prompt the model for the output they desire. For instance, a user could request an image by entering not only a prompt but also a color palette, edges, depth map extracted from another image, and receive output that integrates those elements.
Work With Andrew Ng
Join the teams that are bringing AI to the world! Check out job openings at DeepLearning.AI, AI Fund, and Landing AI.
Subscribe and view previous issues here.
Thoughts, suggestions, feedback? Please send to thebatch@deeplearning.ai. Avoid our newsletter ending up in your spam folder by adding our email address to your contacts list.
|