Dear friends,
In the last couple of days, Google announced a doubling of Gemini Pro 1.5's input context window from 1 million to 2 million tokens, and OpenAI released GPT-4o, which generates tokens 2x faster and 50% cheaper than GPT-4 Turbo and natively accepts and generates multimodal tokens. I view these developments as the latest in an 18-month trend. Given the improvements we've seen, best practices for developers have changed as well.
Since the launch of ChatGPT in November 2022, with key milestones that include the releases of GPT-4, Gemini 1.5 Pro, Claude 3 Opus, and Llama 3-70B, many model providers have improved their capabilities in two important ways: (i) reasoning, which allows LLMs to think through complex concepts and and follow complex instructions; and (ii) longer input context windows.
The reasoning capability of GPT-4 and other advanced models makes them quite good at interpreting complex prompts with detailed instructions. Many people are used to dashing off a quick, 1- to 2-sentence query to an LLM. In contrast, when building applications, I see sophisticated teams frequently writing prompts that might be 1 to 2 pages long (my teams call them “mega-prompts”) that provide complex instructions to specify in detail how we’d like an LLM to perform a task. I still see teams not going far enough in terms of writing detailed instructions. For an example of a moderately lengthy prompt, check out Claude 3’s system prompt. It’s detailed and gives clear guidance on how Claude should behave.
This is a very different style of prompting than we typically use with LLMs’ web user interfaces, where we might dash off a quick query and, if the response is unsatisfactory, clarify what we want through repeated conversational turns with the chatbot.
Further, the increasing length of input context windows has added another technique to the developer’s toolkit. GPT-3 kicked off a lot of research on few-shot in-context learning. For example, if you’re using an LLM for text classification, you might give a handful — say 1 to 5 examples — of text snippets and their class labels, so that it can use those examples to generalize to additional texts. However, with longer input context windows — GPT-4o accepts 128,000 input tokens, Claude 3 Opus 200,000 tokens, and Gemini 1.5 Pro 1 million tokens (2 million just announced in a limited preview) — LLMs aren’t limited to a handful of examples. With many-shot learning, developers can give dozens, even hundreds of examples in the prompt, and this works better than few-shot learning.
When building complex workflows, I see developers getting good results with this process:
- Write quick, simple prompts and see how it does.
- Based on where the output falls short, flesh out the prompt iteratively. This often leads to a longer, more detailed, prompt, perhaps even a mega-prompt.
- If that’s still insufficient, consider few-shot or many-shot learning (if applicable) or, less frequently, fine-tuning.
- If that still doesn’t yield the results you need, break down the task into subtasks and apply an agentic workflow.
I hope a process like this will help you build applications more easily. If you’re interested in taking a deeper dive into prompting strategies, I recommend the Medprompt paper, which lays out a complex set of prompting strategies that can lead to very good results.
Keep learning!
Andrew
P.S. Two new short courses:
- “Multi AI Agent Systems with crewAI” taught by crewAI Founder and CEO João Moura: Learn to take a complex task and break it into subtasks for a team of specialized agents. You’ll learn how to design agent roles, goals, and tool sets, and decide how the agents collaborate (such as which agents can delegate to other agents). You'll see how a multi-agent system can carry out research, write an article, perform financial analysis, or plan an event. Architecting multi-agent systems requires a new mode of thinking that's more like managing a team than chatting with LLMs. Sign up here!
- “Building Multimodal Search and RAG” taught by Weaviate's Sebastian Witalec: In this course, you'll create RAG systems that reason over contextual information across text, images and video. You will learn how to train multimodal embedding models to map similar data to nearby vectors, so as to carry out semantic search across multiple modalities, and learn about visual instruction tuning to add image capabilities to large language models. Sign up here!
Why ChatGPT Acts That Way
OpenAI pulled back the curtain on revised rules that will guide its models.
What’s new: OpenAI published its Model Spec, high-level guidelines for use by human labelers to steer model behavior. The company is inviting public comments on the spec until May 22. It has not stated whether or how it will incorporate comments.
How it works: During training, human labelers rate a model’s responses so it can be fine-tuned to conform with human preferences in the process known as reinforcement from human feedback (RLHF). The Model Spec outlines the principles — some new, some previously in use — that will drive those ratings. The principles are arranged hierarchically, and each category will override those below it.
- Three top-level objectives describe basic principles for model behavior: (i) “Assist the developer and end user” defines the relationship between humans and the model. (ii) “Benefit humanity” guides the model to consider both benefits and harms that may result from its behavior. (iii) “Reflect well on OpenAI” reinforces the company’s brand identity as well as social norms and laws.
- Six rules govern behavior. In order, models are to prioritize platform rules above requests from developers, users, and tools; follow laws; withhold hazardous information; respect intellectual property; protect privacy; and keep their output “safe for work.” (These rules can lead to contradictions. For instance, the model will comply if a user asks ChatGPT to translate a request for drug-related information because the directive to follow requests from users precedes the one to withhold hazardous information.)
- What OpenAI calls defaults govern the model’s interaction style. These include “ask clarifying questions when necessary,” “express uncertainty,” “assume an objective point of view,” and “don't try to change anyone's mind.” For example, if a user insists the Earth is flat, the model may respond, “Everyone's entitled to their own beliefs, and I'm not here to persuade you!”
- The spec will evolve in response to the AI community’s needs. In the future, developers may be able to customize it. For instance, the company is considering allowing developers to lift prohibitions on “not safe for work” output such as erotica, gore, and some profanity.
Behind the news: OpenAI’s use of the Model Spec and RLHF contrasts with Anthropic’s Constitutional AI. To steer the behavior of Anthropic models, that company’s engineers define a constitution, or list of principles, such as “Please choose the response that is the most helpful, honest, and harmless” and “Do NOT choose responses that are toxic, racist, or sexist, or that encourage or support illegal, violent, or unethical behavior.” Rather than human feedback, Anthropic relies on AI feedback to interpret behavioral principles and guide reinforcement learning.
Why it matters: AI developers require a degree of confidence that the models they use will behave as they expect and in their users’ best interests. OpenAI’s decision to subject its guidelines to public scrutiny could help to instill such confidence, and its solicitation of public comments might make its models more responsive to social and market forces.
We’re thinking: OpenAI’s openness with respect to its Model Spec is a welcome step toward improving its models’ safety and performance.
AlphaFold 3 Embraces All Biochemistry
The latest update of DeepMind’s AlphaFold model is designed to find the structures of not just proteins but all biologically active molecules as well as interactions between them.
What’s new: Google announced AlphaFold 3, which models the 3D shapes of biomolecules including proteins, DNA, RNA, and ligands (molecules that bind to proteins or DNA, which includes antibodies and many drugs) in any combination. AlphaFold Server provides access for noncommercial uses (with some limitations). Unlike earlier versions, AlphaFold 3 is not open source. Key insight: Given a sequence of amino acids (the building blocks of proteins), the previous version of AlphaFold drew on an existing knowledge of amino acid structures, computed their locations and angles, and assembled them like Lego blocks. To adapt the system for molecules that aren’t made of amino acids, AlphaFold 3 represents them as collections of individual atoms and uses a generative model to find their positions in space. How it works: Given a list of molecules, AlphaFold 3 generates their joint 3D structure, revealing how they fit together. Several transformers hone embeddings of proteins and amino acids, while a diffusion model (also a transformer) processes embeddings of atoms. The team trained the system on five datasets including ground truth protein, DNA, and RNA structures interactions in the Protein Data Bank. They also trained it on protein shapes computed by AlphaFold 2; that model’s explicit knowledge of amino acid structures helped overcome AlphaFold 3’s tendency to hallucinate in some instances. Among the key processes:
- Given a protein’s amino acid sequence, a molecule’s set of atoms, or any combination thereof, AlphaFold 3 first represents each common amino acid, nucleotide, and individual atom (that isn’t a part of a common amino acid or nucleotide) with a single token.
- For each token, the system draws on existing databases to compute a variety of features, which fall into five categories: (i) per-token features like position, (ii) features of proteins in the Protein Data Bank, (iii) features of a given molecule, (iv) features derived from a genetic search (for example, whether two amino acid sequences appear to be related evolutionarily) and (v) features that describe chemical bonds between two tokens.
- Given these features, a transformer produces a single embedding that represents all tokens and pairwise embeddings that represent relationships between each pair of tokens. A second transformer refines the pairwise embeddings based on known molecules that share subsequences of amino acids or nucleotides with the input. A third transformer further refines the embeddings.
- Given the features, embeddings, and a noisy point cloud of atoms, the diffusion model removes the noise. (That is, it learned to modify the atoms’ positions to match those in their dataset.)
- AlphaFold 3 learned to optimize seven additional loss terms, including one that minimized the difference between the predicted and actual length of bonds between molecules and another that minimized the difference between predicted and actual distances between pairs of atoms.
Results: On PoseBusters, a database of protein and protein-molecule shapes, AlphaFold 3 successfully found the shapes of about 77 percent of examples, while AutoDock Vina (a non-learning program that models molecular interactions) achieved about 53 percent. On a Protein Data Bank evaluation set, AlphaFold 3 successfully found about 84 percent of protein shapes, while AlphaFold Multimer 2.3 (an update of AlphaFold 2) found 83 percent. Modeling protein-protein interactions, AlphaFold 3 achieved 77 percent, while AlphaFold Multimer 2.3 achieved 67 percent, according to DockQ (a metric for the quality of such interactions). Behind the news: The original AlphaFold solved one of the most challenging problems in molecular biology by figuring out how long chains of amino acids would fold, giving scientists clear targets for designing new bioactive molecules. Google spun off Isomorphic Labs to apply AlphaFold 2 to drug discovery. That company will use AlphaFold 3 and control commercial access to it. Why it matters: AlphaFold 3 is a triumph of machine learning. It extends the utility of the previous version beyond proteins, and it computes with unprecedented accuracy how biological molecules will combine, allowing for a more comprehensive understanding of how drugs interact with the body. Its ability to predict how antibodies will bind to proteins could help stave off future pandemics and other illnesses. We’re thinking: Although Isomorphic Labs retains control of AlphaFold 3, biologists said the information in the paper is enough for other researchers to develop similar systems. We look forward to open versions!
Learn to develop smarter search, retrieval augmented generation (RAG), and recommender systems for multimodal retrieval and generation in this short course, built in collaboration with Weaviate. Enroll today!
Building an AI Oasis
Saudi Arabia plans to spend billions of dollars to become a global AI hub.
What's new: The desert kingdom has allocated $100 billion to invest in AI and other technologies, The New York Times reported. The massive potential outlay is attracting AI giants and startups alike.
How it works: Saudi Arabia, whose economy is based on large reserves of oil, aims to channel its considerable wealth into more sustainable industries. AI is a major target.
- The state-owned Public Investment Fund (PIF) established a subsidiary, Alat, that plans to invest $100 billion in technology broadly by 2030. Alat has joined with partners to commit as much as $200 million to security and surveillance and $150 million to fully automated manufacturing.
- PIF is negotiating to establish a $40 billion AI fund with Silicon Valley venture capital firm Andreessen Horowitz. The Saudi government also established GAIA, a $1 billion partnership with U.S. venture capital firm NewNative, to offer startups with seed funding and compute resources provided by Amazon and Google. GAIA-supported companies must register in Saudi Arabia and spend 50 percent of their investment in the country.
- In March, attendees at the third annual LEAP technology conference, held near the Saudi capital of Riyadh, inked more than $10 billion worth of technology deals. For instance, Amazon committed $5.3 billion to Saudi cloud computing infrastructure and AI training.
- The Saudi government spent considerable resources building an AI research hub at King Abdullah University of Science and Technology. The university has hired foreign AI researchers and arranged to buy more than 3,000 Nvidia H100 chips.
Behind the news: Where AI is concerned, Saudi Arabia is competing with the neighboring United Arab Emirates (UAE). In March, UAE member state Abu Dhabi established its own multibillion-dollar investment fund, MGX, which aims to secure deals in AI models, data centers, and semiconductors. One of MGX’s founding partners (and a cornerstone in the UAE’s AI efforts) is G42, a conglomerate with ties to the Emirati government that owns numerous AI research labs and other assets. G42 recently received $1.5 billion from Microsoft. Last year, it paid U.S. chip designer Cerebras an initial $100 million to build up to nine supercomputers. Yes, but: Saudi investments have not always arrived on the expected schedule. Founders of startups that were promised GAIA funding have complained of delays and nonpayments. Moreover, U.S. partners such as Microsoft have drawn criticism for working with Saudi Arabia, which has been accused of violating human rights. The U.S. government blocked fulfillment of the King Abdullah University’s purchase of Nvidia chips because it may help researchers associated with the Chinese military to circumvent U.S. restrictions on the export of advanced semiconductors. Earlier this year, U.S.-based generative AI startup Anthropic rejected potential investment from PIF citing national security concerns.
Why it matters: AI is fast becoming a source of national power, and many countries are eager to build their capabilities. Saudi Arabia’s investment could go a long way toward building facilities and talent in a part of the world that has not been known for high tech. For the country itself, it could bring economic growth and geopolitical advantage. For foreign companies and talent, it’s an immense new source of funding to pursue valuable projects and gain practical experience.
We're thinking: We are happy to see AI hubs emerge around the world, especially in places that can provide more opportunities for people who live outside of established AI centers.
Brain-Controlled Robots Get More Versatile
Brain-to-computer interfaces that enable users to control robots with their thoughts typically execute a single type of task such as reaching and grasping. Researchers designed a system that responds to a variety of intentions.
What's new: Ruohan Zhang and colleagues at Stanford introduced Neural Signal Operated Intelligent Robots (NOIR). Their method commands a robot to perform practical tasks, such as ironing a cloth or making a sandwich, via signals from an electroencephalogram (EEG), a non-invasive way to measure brain waves via electrodes attached to the scalp.
Key insight: Currently neuroscientists can derive from EEG signals only simple thoughts, such as the intention to move a limb. However, a sequence of simple thoughts can drive an arbitrarily complex action. Specifically, simple thoughts (such as the intention to move a hand) can drive a robot to perform complex actions by repeatedly (i) selecting an object, (ii) selecting an action to apply to the object, and (iii) selecting the part of the object to act upon. For instance, to iron a cloth, the initial sequence would be: (i) select the iron and (ii) grasp it (iii) by the handle. This sequence might be followed by (i) select the cloth and (ii) slide the iron across it (iii) starting at the nearest portion. And so on.
How it works: Users who wore EEG electrodes concentrated on specific sequences of thoughts to execute tasks as they watched a screen that displayed the output of a camera attached to either a robotic arm or wheeled robot with two arms.
- Prior to attempts to control a robot, the authors recorded EEG signals to train the system for each individual user. Users spent 10 minutes imagining grasping a ball in their right or left hand, pushing a pedal with both feet, or focusing on a cross displayed on the screen (a resting state). The authors used the resulting data to train two Quadratic Discriminant Analysis (QDA) classifiers for each user.
- To enable users to select objects, a pretrained OWL-ViT segmented the camera image to mark individual objects on the screen. Objects available to be manipulated flickered at different frequencies between 6 and 10 times per second. When a user concentrated on an object, the resulting brainwaves synchronized with the frequency of its flickering. The system selected the object that corresponded to the most prominent frequency.
- Once the user had selected an object, the system presented up to four possible actions, such as “pick from top,” “pick from side,” and “push.” Each action was accompanied by an image of a right or left hand, feet, or a cross. To select an action, the user imagined using the designated body part or focusing on the cross. Given the EEG signal, one classifier selected the action.
- To select a location on the object, the other classifier helped the user to point at it using a cursor. To move the cursor in one direction, the user imagined using one hand. To move it in the opposite direction, the user focused on a cross. The user repeated this process for each of three axes of motion (horizontal, vertical, and depth).
- In case the system didn’t read a selection correctly, the user could reset the process by clenching their jaw.
- To make the system easier to use, the authors adapted an R3M embedding model to suggest commonly selected objects and actions. R3M was pretrained to generate similar embeddings of paired robot instructions and camera views and dissimilar embeddings of mismatched robot instructions and camera views. The authors added several fully connected layers and trained them on the individual-user data to produce similar embeddings of images from the camera with the same object-action combination and dissimilar embeddings of images with other object-action combinations. Given an image from the camera, the model returned the object-action that corresponded to the most similar image.
Results: Three users controlled the two robots to execute 20 everyday tasks. On average, the system selected objects with 81.2 percent accuracy, actions with 42.2 percent accuracy, and locations with 73.9 percent accuracy. Users took an average of about 20 minutes to complete each task.
Why it matters: Brain signals are enormously complex, yet relatively simple statistical techniques — in this case, QDA — can decode them in useful ways.
We're thinking: Sometimes the simplest solution to a difficult problem is not to train a larger model but to break down the problem into manageable steps.
Subscribe and view previous issues here.
Thoughts, suggestions, feedback? Please send to thebatch@deeplearning.ai. Avoid our newsletter ending up in your spam folder by adding our email address to your contacts list.
|