Dear friends,
An increasing variety of large language models (LLMs) are open source, or close to it. The proliferation of models with relatively permissive licenses gives developers more options for building applications.
For most teams, I recommend starting with prompting, since that allows you to get an application working quickly. If you’re unsatisfied with the quality of the output, ease into the more complex techniques gradually. Start one-shot or few-shot prompting with a handful of examples. If that doesn’t work well enough, perhaps use RAG (retrieval augmented generation) to further improve prompts with key information the LLM needs to generate high-quality outputs. If that still doesn’t deliver the performance you want, then try fine-tuning — but this represents a significantly greater level of complexity and may require hundreds or thousands more examples. To gain an in-depth understanding of these options, I highly recommend the course Generative AI with Large Language Models, created by AWS and DeepLearning.AI.
Additional complexity arises if you want to move to fine-tuning after prompting a proprietary model, such as GPT-4, that’s not available for fine-tuning. Is fine-tuning a much smaller model likely to yield superior results than prompting a larger, more capable model? The answer often depends on your application. If your goal is to change the style of an LLM’s output, then fine-tuning a smaller model can work well. However, if your application has been prompting GPT-4 to perform complex reasoning — in which GPT-4 surpasses current open models — it can be difficult to fine-tune a smaller model to deliver superior results.
Beyond choosing a development approach, it’s also necessary to choose a specific model. Smaller models require less processing power and work well for many applications, but larger models tend to have more knowledge about the world and better reasoning ability. I’ll talk about how to make this choice in a future letter.
Keep learning! Andrew
P.S. We just released “Large Language Models with Semantic Search,” a short course built in collaboration with Cohere and taught by Jay Alammar and Luis Serrano. Search is a key part of many applications. Say, you need to retrieve documents or products in response to a user query. How can LLMs help? You’ll learn about (i) embeddings to retrieve a collection of documents loosely related to a query and (ii) LLM-assisted re-ranking to rank them precisely according to a query. You’ll also go through code that shows how to build a search system for retrieving relevant Wikipedia articles. Please check it out!
NewsGPU Shortage IntensifiesNvidia’s top-of-the-line chips are in high demand and short supply. What’s new: There aren’t enough H100 graphics processing units (GPUs) to meet the crush of demand brought on by the vogue for generative AI, VentureBeat reported. Bottleneck: Cloud providers began having trouble finding GPUs earlier this year, but the shortfall has spread to AI companies large and small. SemiAnalysis, a semiconductor market research firm, estimates that the chip will remain sold out into 2024.
Who’s buying: Demand for H100s is hard to quantify. Large AI companies and cloud providers may need tens of thousands to hundreds of thousands of them, while AI startups may need hundreds to thousands.
Behind the news: Nvidia announced the H100 early last year and began full production in September. Compared to its predecessor, the A100, the H100 performs about 2.3 times faster in training and 3.5 times faster at inference. Why it matters: Developers need these top-of-the-line chips to train high-performance models and deploy them in cutting-edge products. At a time when AI is white-hot, a dearth of chips could affect the pace of innovation.
China’s LLMs Open UpThe latest wave of large language models trained in Chinese is open source for some users. What’s new: Internet giant Alibaba released large language models that are freely available to smaller organizations. The internet giant followed Baichuan Intelligent Technology, a startup that contributed its own partly open models, and Beijing Academy of Artificial Intelligence, which announced that its WuDao 3.0 would be open source.
Behind the news: Developers in China are racing to cash in on chatbot fever. But they face unique hurdles.
Why it matters: The March leak of Meta’s LLaMA initiated a groundswell of open models that excel in English and a subsequent explosion of innovation and entrepreneurial activity. Competitive open models trained in Mandarin and other Chinese languages could spark similar developments in one of the world’s biggest countries — as long as developers hew to the law. We’re thinking: High-profile models like ChatGPT and Bard, having been trained on huge amounts of English-language data, tend to know a lot about the histories, geographies, and societies of English-speaking countries but relatively little about places where other languages are spoken. Models trained on Chinese corpora will serve speakers of China’s languages far better, and open source models fine-tuned for Chinese users likely will play an important role.
A MESSAGE FROM DEEPLEARNING.AIJoin our new course, “Large Language Models with Semantic Search,” and learn how to implement powerful search features using LLMs and use your website’s information to generate responses. Enroll for free
ChatGPT’s Best FriendThe latest robot dog is smarter — and less expensive — than ever. What’s new: Unitree Robotics of Hangzhou, China, unleashed Go2, a quadruped robot that trots alongside its owner, stands on two legs, jumps, talks, takes photos, and retails for less than a high-end MacBook. How it works: Go2 is made of aluminum and plastic, weighs around 15 kilograms, and moves using 12 joints. A robotic arm mounted on the unit’s back is optional. It comes in three versions with a starting price of $1,600.
Why it matters: Boston Dynamics’ industrial-strength robodog Spot is manipulating high-voltage electrical equipment, inspecting nuclear power plants, and helping to monitor urban areas. But its price — from $74,500 to $200,000 — puts it out of reach of many potential users. With its dramatically lower price, Go2 suggests that such mechanical beasts may find a wider range of uses.
LLMs Get a LifeLarge language models increasingly reply to prompts with a believably human response. Can they also mimic human behavior? What's new: Joon Sung Park and colleagues at Stanford and Google extended GPT-3.5 to build generative agents that went about their business in a small town and interacted with one another in human-like ways. The code is newly available as open source. Key insight: With the right prompts, a text database, and a server to keep track of things, a large language model (LLM) can simulate human activity.
How it works: The authors designed 25 agents (represented by 2D sprites) who lived in a simulated town (a 2D background depicting the layout and the contents of its buildings) and let them run for two days. Each agent used GPT 3.5; a database of actions, memories, reflections, and plans generated by GPT 3.5; and a server that tracked agent and object behaviors, locations (for instance, in the kitchen of Isabella’s apartment), and statuses (whether a stove was on or off), and relayed this information to agents when they came nearby.
Results: The complete agents exhibited three types of emergent behavior: They spread information initially known only to themselves, formed relationships, and cooperated (specifically to attend a party). The authors gave 100 human evaluators access to all agent actions and memories. The evaluators asked the agents simple questions about their identities, behaviors, and thoughts. Then they ranked the agents’ responses for believability. They also ranked versions of each agent that were missing one or more functions, as well as humans who stood in for each agent (“to identify whether the architecture passes a basic level of behavioral competency,” the authors write). These rankings were turned into a TrueSkill score (a variation on the Elo system used in chess) for each agent type. The complete agent architecture scored highest, while the versions that lacked particular functions scored lower. Surprisingly, the human stand-ins also underperformed the complete agents. Yes, but: Some complete agents “remembered” details they had not experienced. Others showed erratic behavior, like not recognizing that a one-person bathroom was occupied or that a business was closed. And they used oddly formal language in intimate conversation; one ended exchanges with her husband, “It was good talking to you as always.” Why it matters: Large language models produce surprisingly human-like output. Combined with a database and server, they can begin to simulate human interactions. While the TrueSkill results don’t fully convey how humanly these agents behaved, they do suggest a role for such agents in fields like game development, social media, robotics, and epidemiology. We're thinking: The evaluators found the human stand-ins less believable than the full-fledged agents. Did the agents exceed human-level performance in the task of acting human, or does this result reflect a limitation of the evaluation method?
A MESSAGE FROM DEEPLEARNING.AIJoin our upcoming workshop with Predibase and learn how to use open source tools to overcome challenges like the “host out of memory” error when fine-tuning models like Llama-2. Register now
Work With Andrew Ng
Join the teams that are bringing AI to the world! Check out job openings at DeepLearning.AI, AI Fund, and Landing AI.
Subscribe and view previous issues here.
Thoughts, suggestions, feedback? Please send to thebatch@deeplearning.ai. Avoid our newsletter ending up in your spam folder by adding our email address to your contacts list.
|