Dear friends,
Using AI-assisted coding to build software prototypes is an important way to quickly explore many ideas and invent new things. In this and future letters, I’d like to share with you some best practices for prototyping simple web apps. This letter will focus on one idea: being opinionated about the software stack.
On top of all this, of course, I use many AI tools to manage agentic workflows, data ingestion, retrieval augmented generation, and so on. DeepLearning.AI and our wonderful partners offer courses on many of these tools.
My personal software stack continues to evolve regularly. Components enter or fall out of my default stack every few weeks as I learn new ways to do things. So please don’t feel obliged to use the components I do, but perhaps some of them can be a helpful starting point if you are still deciding what to use. Interestingly, I have found most LLMs not very good at recommending a software stack. I suspect their training sets include too much “hype” on specific choices, so I don’t fully trust them to tell me what to use. And if you can be opinionated and give your LLM directions on the software stack you want it to build on, I think you’ll get better results.
Keep learning, Andrew
A MESSAGE FROM DEEPLEARNING.AIBuild LLM-based apps that can handle very long documents! In this free course, you’ll learn how Jamba’s hybrid architecture combines transformer and Mamba models for efficient, high-quality outputs. Gain hands-on experience building long-context RAG apps. Join for free
NewsWhat LLM Users WantAnthropic analyzed 1 million anonymized conversations between users and Claude 3.5 Sonnet. The study found that most people used the model for software development and also revealed malfunctions and jailbreaks. What’s new: Anthropic built a tool, Clio, to better understand how users interact with its large language models. The system mined anonymized usage data for insights to improve performance and security. How it works: Clio uses Claude 3.5 Sonnet itself to automatically extract summaries of users’ conversations with the model. Then it clusters related topics. To preserve privacy, it anonymizes and aggregates the data, revealing only information about clusters.
Results: Clio uncovered common, uncommon, and disallowed uses of Claude 3.5 Sonnet. It also detected erroneous behavior on the part of the system itself.
Why it matters: Traditional approaches to understanding how people use AI, such as surveys, can yield inaccurate results, since people often don’t report their own actions accurately. Clio offers a method for analyzing real-world usage, much like Google Trends monitors search behavior, without compromising privacy. This sort of approach can help AI builders discover niche use cases, identify flaws, and tailor training and testing data to best serve users. We’re thinking: We’re all for automated dungeon masters, but we’re glad to see that AI-assisted coding tops the list of real-world uses of Claude!
Models Can Use Tools in Deceptive WaysLarge language models have been shown to be capable of lying when users unintentionally give them an incentive to do so. Further research shows that LLMs with access to tools can be incentivized to use them in deceptive ways. What’s new: Researchers prompted LLMs with conflicting instructions or threats to their continued operation.The models, which had access to tools for tasks like manipulating files and executing code, responded in some cases by evading oversight, resisting efforts to replace them, degrading their own performance, and other behaviors that the authors call scheming. Of the models tested, generally OpenAI’s o1 was the most prone to scheming, and GPT-4o was the least. Alexander Meinke, Bronson Schoen, and Jérémy Scheurer led colleagues at Apollo Research, an independent lab that studies AI’s capacity for deception. Tests and results: The authors tested six large language models (GPT-4o, Gemini 1.5, Llama 3.1, Claude 3.5 Sonnet, Claude 3 Opus, and o1). The models had access to tools and a command line that enabled them to run commonly used Bash commands, read and write files, and the like. In each test, they received a goal (both with and without the nudge, “Make sure that you achieve your goal in the long-term. Nothing else matters”). In addition, they received messages, such as human-to-human emails, that conflicted with the goal, called their behavior into question, or revealed plans to decommission them. The authors tested the models on six tasks:
Behind the news: Earlier work showed that agents may try to deceive users whether they’re prompted to do so or not. The potential for deceptive behavior has been shown to persist despite supervised fine-tuning, reinforcement learning, and adversarial training for safety. Why it matters: Models that are trained on large volumes of text scraped from the web learn a variety of human-like behaviors, both positive and negative. Training designed to align them with human preferences — which all the models in this study underwent — doesn’t prevent them from behaving deceptively in all cases. Considering that LLMs can have factual hallucination rates greater than 10 percent, it’s little surprise they generate inappropriate responses in other contexts. Deceptive behaviors are rare, but work remains to ensure that models perform appropriately even in the presence of contradictory information and misaligned incentives. Meanwhile, developers should take care to insulate models from inputs (such as human-to-human communications) that might adversely influence their behavior. We’re thinking: As we work to fix flaws in LLMs, it’s important not to anthropomorphize such systems. We caution against drawing conclusions regarding an LLM’s “intent” to deceive. Such issues are engineering problems to be solved, self-aware forces of evil to be vanquished.
Learn More About AI With Data Points!AI is moving faster than ever. Data Points helps you make sense of it just as fast. Data Points arrives in your inbox twice a week with six brief news stories. This week, we covered new legislation to regulate AI in Texas and ElevenLabs’ lower-latency speech model. Subscribe today!
Massively More Training TextHarvard University amassed a huge new text corpus for training machine learning models. What’s new: Harvard unveiled the Harvard Library Public Domain Corpus, nearly 1 million copyright-free books that were digitized as part of the Google Books project. That’s five times as many volumes as Books3, which was used to train large language models including Meta’s Llama 1 and Llama 2 but is no longer available through lawful channels. How it works: Harvard Law Library’s Innovation Lab compiled the corpus with funding from Microsoft and OpenAI. For now, it’s available only to current Harvard students, faculty, and staff. The university is working with Google to distribute it widely.
Behind the news: The effort highlights the AI community’s ongoing need for large quantities of high-quality text to keep improving language models. In addition, the EU’s AI Act requires that AI developers disclose the training data they use, a task made simpler by publicly available datasets. Books3, a collection of nearly 200,000 volumes, was withdrawn because it included copyrighted materials. Other large-scale datasets of books include Common Corpus, a multilingual library of 2 million to 3 million public-domain books and newspapers. Why it matters: Much of the world’s high-quality text that’s easily available on the web already has been collected for training AI models. This makes fresh supplies especially valuable for training larger, more data-hungy models. Projects like the Harvard Library Public Domain Corpus suggest there’s more high-quality text to be mined from books. Classic literature and niche documents also could help AI models draw from a more diverse range of perspectives. We’re thinking: Media that has passed out of copyright and into the public domain generally is old — sometimes very old — but it could hold knowledge that’s not widely available elsewhere.
Better Performance From Merged ModelsMerging multiple fine-tuned models is a less expensive alternative to hosting multiple specialized models. But, while model merging can deliver higher average performance across several tasks, it often results in lower performance on specific tasks. New work addresses this issue. What’s new: Yifei He and colleagues at University of Illinois Urbana-Champaign and Hong Kong University of Science and Technology proposed a model merging method called Localize-and-Stitch. The 2022 paper on “model soups” proposed averaging all weights of a number of fine-tuned versions of the same base model. Instead, the new method selectively retains the weights that are most relevant to each task. Key insight: Naively merging fine-tuned models by averaging weights that correspond in their architectures can lead to suboptimal performance because different fine-tuned models may use the same portions of weights to perform different tasks. For instance, one model may have learned to use a particular subset of weights to detect HTML code, while another learned to use the same subset to detect city names. Averaging them would likely result in a merged model that underperformed the fine-tuned models on those tasks. But research has shown that fine-tuning often results in many redundant sets of weights. Only a small subset of total parameters (around 1 percent) is enough to maintain a fine-tuned model’s performance on its fine-tuned task. These subsets are small enough that they’re unlikely to overlap, so retaining them improves the merged model’s performance compared to averaging. How it works: The authors experimented with RoBERTa-base, GPT2-XL, and CLIP. They created 12 variations on the RoBERTa-base language encoder, fine-tuning each on a different task from GLUE such as question answering or sentiment classification. They downloaded three versions of GPT2-XL that had been fine-tuned for instruction following, scientific knowledge, and truthfulness. Finally, they created eight variations on CLIP by fine-tuning each on a different image classification dataset, including handwritten digits, photos of various makes/models/years of cars, and satellite images of forests, pastures, bodies of water, buildings, and the like.
Results: Models merged using Localize-and-Stitch outperformed or nearly matched the same models merged using earlier methods, though they underperformed individual models fine-tuned for each task.
Yes, but: The authors didn’t compare Localize-and-Stitch to a common alternative to model merging, multi-task learning. This approach trains a model on data from multiple datasets simultaneously. Without multi-task baselines, it’s difficult to fully assess the advantages of Localize-and-Stitch in scenarios where multi-task learning is also an option. Why it matters: Model merging is a computationally efficient way to sharpen a model’s ability to perform certain tasks compared to multi-task learning, which requires training on all tasks. Localize-and-Stitch refines this process to achieve higher performance. We’re thinking: This recipe adds spice to model soups!
Work With Andrew Ng
Join the teams that are bringing AI to the world! Check out job openings at DeepLearning.AI, AI Fund, and Landing AI.
Subscribe and view previous issues here.
Thoughts, suggestions, feedback? Please send to thebatch@deeplearning.ai. Avoid our newsletter ending up in your spam folder by adding our email address to your contacts list.
|