Dear friends,
A barrier to faster progress in generative AI is evaluations (evals), particularly of custom AI applications that generate free-form text. Let’s say you have a multi-agent research system that includes a researcher agent and a writer agent. Would adding a fact-checking agent improve the results? If we can’t efficiently evaluate the impact of such changes, it’s hard to know which changes to keep.
The cost of running evals poses an additional challenge. Let’s say you’re using an LLM that costs $10 per million input tokens, and a typical query has 1000 tokens. Each user query therefore costs only $0.01. However, if you iteratively work to improve your algorithm based on 1000 test examples, and if in a single day you evaluate 20 ideas, then your cost will be 20*1000*0.01 = $200. For many projects I’ve worked on, the development costs were fairly negligible until we started doing evals, whereupon the costs suddenly increased. (If the product turned out to be successful, then costs increased even more at deployment, but that was something we were happy to see!)
Keep learning! Andrew
A MESSAGE FROM DEEPLEARNING.AILearn how to build and customize multi-agent systems in “AI Agentic Design Patterns with AutoGen,” made in collaboration with Microsoft and Penn State University. Use the AutoGen framework and implement four agentic design patterns: Reflection, Tool Use, Planning, and Multi-Agent Collaboration. Sign up for free
NewsHeart-Risk Model Saves LivesA deep learning model significantly reduced deaths among critically ill hospital patients. What’s new: A system built by Chin-Sheng Lin and colleagues at Taiwan’s National Defense Medical Center analyzed patients’ heart signals and alerted physicians if it detected a high risk of death. It reduced deaths of high-risk patients by 31 percent in a randomized clinical trial. How it works: Researchers trained a convolutional neural network, given an electrocardiogram (a measurement of the heart’s electrical activity), to estimate a risk score. The system compares a patient’s risk score against those of other patients. Scores that rank in the 95th percentile or higher are considered high risk of death within 90 days.
Results: 8.6 percent of patients in the control group and 8.9 percent of patients in the experimental group raised a high-risk alert during the trial. In the experimental group, 16 percent of high-risk patients died; in the control group, 23 percent of high-risk patients died. Overall, in the experimental group, 3.6 percent of patients died; in the control group, 4.3 percent of patients died. The model was trained to predict mortality from all causes, but it showed unusually strong predictive capability for heart-related deaths. Examining causes of death, the authors found that 0.2 percent of patients in the experimental group died from heart-related conditions such as cardiac arrest versus 2.4 percent in the control group. We’re thinking: This relatively low-cost AI system unambiguously saved lives over three months at different hospitals! We look forward to seeing it scale up.
Self-Driving on Indian RoadsFew makers of self-driving cars have braved the streets of India. Native startups are filling the gap. What’s new: Indian developers are testing autonomous vehicles on their nation’s disorderly local roads. To cope with turbulent traffic, their systems use different technology from their Western and East Asian counterparts, IEEE Spectrum reported. How it works: In Indian cities, two-, three-, and four-wheelers share the road with trucks, pedestrians, and animals. Drivers often contend with debris and potholes, and many don’t follow rules. These conditions demand vehicles outfitted with technology that’s more flexible (and less expensive) than the interwoven sensors, models, and 3D maps employed by self-driving cars designed for driving conditions like those found in the United States.
Behind the news: Bringing self-driving cars to India has political as well as technical dimensions. Many Indians hire full-time drivers, and the country’s minister of roads and highways has resisted approving the technology because of its potential impact on those jobs. Drivers cost as little as $150 per month, which puts self-driving car makers under pressure to keep their prices very low. Moreover, India’s government insists that vehicles sold there must be manufactured locally, posing a barrier to foreign makers of self-driving cars. Why it matters: Rather than starting with an assumption that traffic follows orderly patterns with many edge cases, Indian developers assume that traffic is essentially unpredictable. For them, events that most developers would consider outliers — vehicles approaching in the wrong lanes, drivers who routinely play chicken, domestic animals in the way — are common. This attitude is leading them to develop robust self-driving systems that not only may be better suited to driving in complex environments but also may respond well to a broader range of conditions. We’re thinking: Former Uber CEO Travis Kalanick said that India would be “the last one” to get autonomous cars. These developers may well prove him wrong!
Knowledge Workers Embrace AIAI could offer paths to promotion and relief from busywork for many knowledge workers. What’s new: 75 percent of knowledge workers worldwide use AI even if they need to supply their own tools, according to survey conducted by Microsoft and Linkedin. How it works: The authors questioned 3,800 workers in 31 countries throughout the Americas, Europe, Asia, and Australia, asking whether and how they used consumer-grade generative systems like Microsoft Copilot and OpenAI ChatGPT. Majorities of all age groups used AI at work, including 85 percent of respondents 28 or younger and 73 percent of those 58 or older.
Behind the news: The survey results agree with those of other studies of AI’s impact on the workplace. In January, the International Monetary Fund projected that AI would affect 40 percent of all jobs worldwide (either complementing or replacing them), including 60 percent of jobs in countries like the UK and U.S. that have greater percentages of knowledge workers. A 2023 research paper argued that white-collar occupations were most likely to be affected by generative AI, in contrast to previous waves of automation that primarily affected blue-collar jobs. Automation driven by AI increased overall employment, evidence gathered by the European Central Bank shows. Why it matters: AI is transforming work from the bottom up. Executives and managers want employees who know how to use the technology, but only 39 percent of the people who already do so received training from their employers. Company-wide encouragement to experiment with and take advantage of AI leads to the best outcomes. We’re thinking: Knowing how to use AI tools is a plus in the current job market. Knowing how to build applications using AI opens another world of doors.
Richer Context for RAGText excerpts used in retrieval augmented generation (RAG) tend to be short. Researchers used summarization to pack more relevant context into the same amount of text. What’s new: Parth Sarthi and colleagues at Stanford built Recursive Abstractive Processing for Tree-Organized Retrieval (RAPTOR), a retrieval system for LLMs. RAPTOR can choose to deliver original text or summaries at graduated levels of detail, depending on the LLM’s maximum input length. Key insight: RAG improves the output of large language models by gathering from documents and/or web pages excerpts that are relevant to a user’s prompt. These excerpts tend to be brief to avoid exceeding an LLM’s maximum input length. For instance, Amazon Bedrock’s default excerpt length is 200 tokens (words or parts of a word). But important details may be scattered throughout longer passages, so short excerpts can miss them. A summarizer can condense longer passages into shorter ones, and summarizing summaries can condense large amounts of text into short passages. How it works: RAPTOR retrieved material from QASPER, a question answering corpus that contains around 1,600 research papers on natural language processing. The authors processed QASPER through an iterative cycle of summarizing, embedding, and clustering. The result was a graduated series of summaries at ever higher levels of abstraction.
Results: Paired with a variety of LLMs, RAPTOR exceeded other retrievers in RAG performance on QASPER’s test set. Paired with the UnifiedQA LLM, RAPTOR achieved 36.7 percent F1 score (here, the percentage of tokens in common between the output and ground truth), while SBERT (with access to only the 100-token excerpts) achieved 36.23 percent F1 score. Paired with GPT-4, RAPTOR achieved 55.7 percent F1 score (setting a new state of the art for QASPER), DPR achieved 53.0 percent F1 score, and providing paper titles and abstracts achieved 22.2 percent F1 score. Why it matters: Recent LLMs can process very long inputs, notably Gemini 1.5 (up to 2 million tokens) and Claude 3 (200,000 tokens). But it takes time to process so many tokens. Further, prompting with long inputs can be expensive, approaching a few dollars for a single prompt in extreme cases. RAPTOR enables models with tighter input limits to get more context from fewer tokens. We’re thinking: This may be the technique that developers who struggle with input context length have been long-ing for!
Work With Andrew Ng
Join the teams that are bringing AI to the world! Check out job openings at DeepLearning.AI, AI Fund, and Landing AI.
Subscribe and view previous issues here.
Thoughts, suggestions, feedback? Please send to thebatch@deeplearning.ai. Avoid our newsletter ending up in your spam folder by adding our email address to your contacts list.
|