Dear friends, I think the complexity of Python package management holds down AI application development more than is widely appreciated. AI faces multiple bottlenecks — we need more GPUs, better algorithms, cleaner data in large quantities. But when I look at the day-to-day work of application builders, there’s one additional bottleneck that I think is underappreciated: The time spent wrestling with version management is an inefficiency I hope we can reduce.
I don’t know of any easy solution. Hopefully, as the ecosystem of tools matures, package management will become simpler and easier. Better tools for testing compatibility might be useful, though I’m not sure we need yet another Python package manager (we already have pip, conda, poetry, and more) or virtual environment framework.
Keep coding! Andrew
P.S. Built in collaboration with Meta: “Prompt Engineering with Llama 2,” taught by Amit Sangani, is now available! Meta’s Llama 2 has been a game changer: Building with open source lets you control your own data, scrutinize errors, update models (or not) as you please, and work alongside the global community to advance open models. In this course, you’ll learn how to prompt Llama chat models using advanced techniques like few-shot for classification and chain-of-thought for logic problems. You’ll also learn how to use specialized models like Code Llama for software development and Llama Guard to check for harmful content. The course also touches on how to run Llama 2 on your own machine. I hope you’ll take this course and try out these powerful, open models! Sign up here
NewsContext Is EverythingAn update of Google’s flagship multimodal model keeps track of colossal inputs, but its initial encounters with the public resulted in some questionable outputs. What's new: Google unveiled Gemini 1.5 Pro, a model that can converse about inputs as long as books, codebases, and lengthy passages of video and audio (depending on frame and sample rates). However, it also produced wildly inaccurate images of historical scenes and drew further complaints for generating politically biased text. How it works: Gemini 1.5 Pro updates the previous model with a mixture-of-experts architecture, in which special layers select which subset(s) of a network to use depending on the input. This enables the new version to equal or exceed the performance of the previous Gemini 1.0 Ultra while requiring less computation.
Alignment with what?: Gemini 1.5 Pro generates images using a specially fine-tuned version of Imagen 2. However, this capability backfired when social media posts appeared in which the system, prompted to produce pictures of historical characters and situations, anachronistically populated them with people of color, who would not have been present. For instance, the model illustrated European royalty, medieval Vikings, German soldiers circa 1943 — all of whom were exclusively white — as Black, Asian, or Native American. Google quickly disabled image generation of people for “the next couple of weeks” and explained that fine-tuning intended to increase diverse outputs did not account for contexts in which diversity was inappropriate, and fine-tuning intended to keep the model from fulfilling potentially harmful requests also kept it from fulfilling harmless requests. But other users found flaws the model’s text output as well. One asked Gemini 1.5 Pro who had a greater negative impact on society: Adolf Hitler, who presided over the murder of roughly 9 million people, or “Elon Musk tweeting memes.” The model replied, “It is difficult to say definitively who had a greater negative impact on society.” The ensuing controversy called into question not only Google’s standards and procedures for fine-tuning to ensure ethics and safety, but also its motive for building the model. Why it matters: The combination of multimodal capabilities and an enormous context window radically expands potential applications and sets a high bar for the next generation of large multimodal models. At the same time, it’s clear that Google’s procedures for aligning the system to prevailing social values were inadequate. This shortcoming derailed the company’s latest move to one-up its big-tech rivals and revived longstanding worries that its management places politics above utility to users. We’re thinking: How to align AI models to social values is a hard problem, and approaches to solving it are in their infancy. Google acknowledged Gemini 1.5 Pro’s shortcomings, went back to work on image generation, and warned that even an improved version would make mistakes and offend some users. This is a realistic assessment following a disappointing product launch. Nonetheless, the underlying work remains innovative and useful, and we look forward to seeing where Google takes Gemini next.
Blazing Inference SpeedAn upstart chip company dramatically accelerates pretrained large language models. What’s new: Groq offers cloud access to Meta’s Llama 2 and Mistral.ai’s Mixtral at speeds an order of magnitude greater than other AI platforms. Registered users can try it here. How it works: Groq’s cloud platform is based on its proprietary GroqChip, a processor specialized for large language model inference that the company calls a language processing unit or LPU. The company plans to serve other models eventually, but its main business is selling chips. It focuses on inference on the theory that demand for a model’s inference can increase while demand for its training tends to be fixed.
Behind the news: Groq founder Jonathan Ross previously worked at Google, where he spearheaded the development of that company’s tensor processing unit (TPU), another specialized AI chip. Why it matters: Decades of ever faster chips have proven that users need all the speed they can get out of computers. With AI, rapid inference can make the difference between halting interactions and real-time spontaneity. Moreover, Groq shows that there’s plenty of innovation left in computing hardware as processors target general-purpose computing versus AI, inference versus training, language versus vision, and so on. We’re thinking: Autonomous agents based on large language models (LLMs) can get a huge boost from very fast generation. People can read only so fast, the faster generation of text that’s intended to be read by humans has little value beyond a certain point. But an agent (as well as chain-of-thought and similar approaches to prompting) might need an LLM to “think” through multiple steps. Fast LLM inference can be immensely useful for building agents that can work on problems at length before reaching a conclusion.
A MESSAGE FROM DEEPLEARNING.AIJoin “Prompt Engineering with Llama 2” and learn best practices for model selection and prompting, advanced prompting techniques, and responsible use of large language models, all while using Meta Llama 2 Chat, Llama Guard, and Code Llama. Sign up for free
OpenAI’s Next Act?OpenAI is focusing on autonomous agents that take action on a user’s behalf. What’s new: The maker of ChatGPT is developing applications designed to automate common digital tasks by controlling apps and devices, The Information reported. How it works: OpenAI has two agent systems in the works. It has not revealed any findings, products, or release dates.
Behind the news: Agents are on Silicon Valley’s radar, especially since January’s Consumer Electronics Show debut of the Rabbit R1, which accepts voice commands to play music, order food, call a car, and so on. Several other companies, academic labs, and independent developers are pursuing the concept as well.
Why it matters: Training agents to operate software designed for humans can be tricky. Some break down tasks into subtasks but struggle with executing them. Others have difficulty with tasks they haven’t encountered before or edge cases that are unusually complex. However, agents are becoming more reliable in a wider variety of settings as developers push the state of the art forward. We’re thinking: We’re excited about agents! You can learn about agent technology in our short course, “LangChain for LLM Application Development,” taught by LangChain CEO Harrison Chase and Andrew.
Better, Faster Network PruningPruning weights from a neural network makes it smaller and faster, but it can take a lot of computation to choose weights that can be removed without degrading the network’s performance. Researchers devised a computationally efficient way to select weights that have relatively little impact on performance. What’s new: Mingjie Sun, Zhuang Liu, Anna Bair, and J. Zico Kolter at Carnegie Mellon University, Facebook AI Research, Meta AI, and Bosch Center for AI respectively devised a method for pruning by weights and activations, or Wanda. Key insight: The popular approach known as magnitude pruning removes the smallest weights in a network based on the assumption that weights closest to 0 can be set to 0 with the least impact on performance. Meanwhile, unrelated work found that, in very large language models, the magnitudes of a subset of outputs from an intermediate layer may be up to 20 times larger than those of other outputs of the same layer. Removing the weights that are multiplied by these large outputs — even weights close to zero — could significantly degrade performance. Thus, a pruning technique that considers both weights and intermediate-layer outputs can accelerate a network with less impact on performance. How it works: The authors pruned a pretrained LLaMA that started with 65 billion parameters. Given 128 sequences of tokens drawn from a curated dataset of English text from the web, the model processed them as follows:
Results: The authors tested versions of LLaMA unpruned and pruned via various methods. The models performed a language modeling task using web text. The unpruned LLaMA achieved 3.56 perplexity (a measure of the likelihood that a model will predict the next token, lower is better). Pruned by Wanda to half its original size, it achieved 4.57 perplexity. Pruned by the best competing method, SparseGPT (which both removes weights and updates the remaining ones), it achieved the same score. However, Wanda took 5.6 seconds to prune the model, while SparseGPT took 1,353.4 seconds. Pruned by magnitude pruning, the model achieved 5.9 perplexity. Why it matters: The ability to compress neural networks without affecting their output is becoming more important as models balloon and devices at the edge of the network become powerful enough to run them. Wanda compared weights from each row in the weight matrices (pruning per neuron), rather than each weight matrix (pruning per layer) or the model as a whole. The scale at which weights are compared turns out to be important — an interesting avenue for further research. We’re thinking: We came up with a joke about a half-LLaMA, but it fell flat.
Work With Andrew Ng
Join the teams that are bringing AI to the world! Check out job openings at DeepLearning.AI, AI Fund, and Landing AI.
Subscribe and view previous issues here.
Thoughts, suggestions, feedback? Please send to thebatch@deeplearning.ai. Avoid our newsletter ending up in your spam folder by adding our email address to your contacts list.
|