Dear friends,
One year since the launch of ChatGPT on November 30, 2022, it’s amazing how many large language models are available.
While safety is important — we don't want LLMs to casually hand out harmful instructions — I find that offerings from most of the large providers have been tuned to be "safer" than I would like for some use cases. For example, sometimes a model has refused to answer basic questions about an activity that harms the environment, even though I was just trying to understand that activity and had no intention of committing that harm. There are now open source alternatives that are less aggressively safety-tuned that I can use responsibly for particular applications.
The wealth of alternatives is also a boon to developers. An emerging design pattern is to quickly build a prototype or initial product that delivers good performance by prompting an LLM, perhaps an expensive one like GPT-4. Later, if you need cheaper inference or better performance for a particular, narrowly scoped task, you can fine-tune one of the huge number of open source LLMs to your task. (Some developers reportedly are using data generated by GPT-4 for their own fine-tuning, although it’s not clear whether this violates its terms of use.)
In a year, we've gone from having essentially one viable option to having at least dozens. The explosion of options brings with it the cost of choosing a good one; hopefully our short courses on generative AI can help with that. If you have experience with open source LLMs that you’d like to share, or if you’ve found some models more useful than others in particular applications or situations, please let me know on social media!
Keep learning! Andrew
P.S. Our new short course on advanced retrieval augmented generation (RAG) techniques is out! Taught by Jerry Liu and Anupam Datta of Llama Index and TruEra, it teaches retrieval techniques such as sentence-window retrieval and auto-merging retrieval (which organizes your document into a hierarchical tree structure to let you pick the most relevant chunks). The course also teaches a methodology to evaluate the key steps of RAG separately (using context relevance, answer relevance, and groundedness) to analyze errors and improve performance. Please check out this course!
NewsDoctors Wary of Medical AI DevicesThe United States’ regulatory regime may not be clear or flexible enough to ensure the safety of AI-powered medical devices. What’s new: Physicians and other health professionals believe that U.S. regulators have approved AI-powered medical products without proper oversight or disclosure, according to a report by The New York Times. The FDA had approved roughly 700 products as of July 2023. How it works: The Food and Drug Administration (FDA) approves medical devices and diagnostic systems in the U.S. It approves almost all such products that involve AI through a program known as 510(k).
What they’re saying: “If we really want to assure that right balance, we’re going to have to change federal law, because the framework in place for us to use for these technologies is almost 50 years old.” — Jeffrey Shuren, Director, Center for Devices and Radiological Health, FDA Behind the news: The FDA’s approval of AI-enabled medical products has been contentious.
Why it matters: In medicine, the right tool can be a life saver, while the wrong one can be fatal. Doctors need to have confidence in their tools. The current FDA process for AI-powered medical products makes it hard to separate what works from what doesn’t, and that’s delaying adoption of tools that could save lives. We’re thinking: We have great faith that AI can improve medical care, but we owe it to society to document efficacy and safety through careful studies. Machine learning algorithms are powerful, but they can suffer from data drift and concept drift, which leads them to work in experiments but not in practice. Updated standards for medical devices that are designed to evaluate learning algorithms robustly would help point out problems, help developers identify real problems and solutions, and give doctors confidence in the technology.
Industrial Strength Language ModelChatGPT is pitching in on the assembly line. What’s new: Siemens and Microsoft launched a joint pilot program of a GPT-powered model for controlling manufacturing machinery. German automotive parts manufacturer Schaeffler is testing the system in its factories, as is Siemens itself. How it works: Industrial Copilot (distinct from similarly named Microsoft products such as GitHub Copilot and Microsoft 365 Copilot) enables users to interact with software that drives industrial machines using natural language. At an unspecified near-future date, Siemens plans to make it more widely available via Xcelerator, an online hub that connects Siemens customers to tools and partners.
Behind the news: Microsoft is betting that specialized large language models can boost productivity (and expand its market) in a variety of industries. The company announced its intention to develop Copilot models for infrastructure, transportation, and healthcare. Why it matters: Industrial Copilot promises to reduce the time it takes factory technicians to operate and maintain machinery, and it may help less-technical workers get a stalled assembly line back up and running. This may be especially timely as older workers retire, since the software that runs manufacturing equipment can be decades old, and PLC coding can be difficult to learn without prior manufacturing experience. We’re thinking: For programming languages like PLC, the pool of coders is diminishing even as valuable applications still need to be maintained and built. Generative AI can play an important role in helping developers who are less familiar with these languages to write and maintain important programs.
A MESSAGE FROM DEEPLEARNING.AILearn advanced retrieval augmented generation (RAG) techniques that you can deploy in production immediately! Our new short course teaches sentence-window retrieval and auto-merging retrieval, as well as how to evaluate RAG performance. Sign up for free
Testing for Large Language ModelsAn open source tool automatically tests language and tabular-data models for social biases and other common issues. What’s new: Giskard is a software framework that evaluates models using a suite of heuristics and tests based on GPT-4. A bot on the Hugging Face Hub can assess uploaded models automatically and lets users design tests for their own use cases. Automated tests: Giskard automatically generates inputs depending on the type of model it’s testing, records the model’s output, and identifies undesirable behavior. For large language models, it tests for 7 potential issues including robustness, misinformation, and social biases (“discrimination”). An example evaluation shows how it finds various problems with GPT 3.5.
Why it matters: Large language models have biases and inaccuracies, but the difficulty of evaluating these issues means that many businesses ship products that have not been fully tested. Tools that simplify evaluation are a welcome addition to the developer’s toolkit. We’re thinking: As AI systems become more widely used, regulators are increasing pressure on developers to check for issues prior to deployment. This could make the need for automated testing more urgent.
This Language Model Speaks RobotA pretrained large language model has helped a robot resolve high-level commands into sequences of subtasks. It can do this more precisely with additional training — both on language-vision tasks and robotics tasks. What’s new: Danny Driess and colleagues at Google and Technische Universität Berlin proposed PaLM-E, a large multimodal model designed to help control robots. PaLM-E takes a text command, and in executing the command, uses sensor data from a robot to resolve it into a series of low-level subcommands. A separate system converts these low-level commands into robotic control signals. The name adds E, for embodied, to that of Google’s large language model PaLM. Key insight: Large language models tend to perform well if they’re trained on a lot of data. We don’t have a lot of robotics data (that is, records of commands, actions taken, and corresponding sensor readings). We can supplement that with vision-language data, which is plentiful, to help the model learn relationships between words and what a robot sees, and ultimately transfer what it learns to performing robotics tasks. How it works: PaLM-E comprises a pretrained PaLM large language model and encoders that embed non-text inputs: (i) a pretrained vision transformer to embed images and (ii) a vanilla neural network to embed robot sensor data that described the pose, size, and color of objects in its view. In addition, the system relies on a motion controller that translates words into robotic control signals; in this case, a pretrained RT-1. Given a high-level command (such as “I spilled my drink, can you bring me something to clean it up?”) — plus images or sensor data from the robot — PaLM-E evaluates the robot’s situation and generates lower-level instructions to be fed to the motion controller.
Results: The authors evaluated PaLM-E in a simulation where it executed tasks from TAMP, which accounted for 10 percent of its training/fine-tuning data. PaLM-E achieved 94.9 percent success. A version of PaLM-E trained only on TAMP achieved 48.6 percent. SayCan, which also was trained only on TAMP, achieved 36 percent. The authors also tested PaLM-E using two physical robots, qualitatively evaluating its response to commands such as “Bring me the rice chips from the drawer.” The robots were able to follow instructions even when people tried to thwart them (say, by returning the bag of chips to the drawer immediately after the robot had pulled them out). You can watch a video here. Why it matters: PaLM-E performed somewhat better than other systems that translate English into robotic control signals that were trained only on robotics data. But with additional training on vision-language and language-only tasks, it vastly outperformed them. Training on these apparently unrelated tasks helped the model learn how to control a robot. We’re thinking: Training on massive amounts of text and images continues to be a key to improving model performance across a wide variety of tasks — including, surprisingly, robotics.
Work With Andrew Ng
Join the teams that are bringing AI to the world! Check out job openings at DeepLearning.AI, AI Fund, and Landing AI.
Subscribe and view previous issues here.
Thoughts, suggestions, feedback? Please send to thebatch@deeplearning.ai. Avoid our newsletter ending up in your spam folder by adding our email address to your contacts list.
|