Dear friends,
One of the dangers of large language models (LLMs) is that they can confidently make assertions that are blatantly false. This raises worries that they will flood the world with misinformation. If they could moderate their degree of confidence appropriately, they would be less likely to mislead.
People are prone to following authority figures. Because a lot of text on the internet is written in an authoritative style — hopefully because the authors know what they’re talking about— LLMs have learned to mimic this style. Unfortunately, LLMs can speak in this style even when they get the facts completely wrong.
Playing with ChatGPT, the latest language model from OpenAI, I found it to be an impressive advance from its predecessor GPT-3. Occasionally it says it can’t answer a question. This is a great step! But, like other LLMs, it can be hilariously wrong. Work lies ahead to build systems that can express different degrees of confidence.
Keep learning! Andrew
NewsMore Plausible Text, Familiar FailingsMembers of the AI community tested the limits of the ChatGPT chatbot, unleashing an avalanche of tweets that made for sometimes-great, sometimes-troubling entertainment. What’s new: OpenAI launched a public demo of ChatGPT, the latest in the research lab’s line of large language models. Like its predecessors, ChatGPT generates text in a variety of styles, for a variety of purposes. Unlike them, it does so with greater finesse, detail, coherence, and — dare we say it? — personality. (How else to characterize a model that apologizes for its misbehavior?) One million users have signed up since the launch last Wednesday. How it works: ChatGPT is a next-generation language model (of a class referred to as GPT-3.5) trained in the manner of OpenAI’s earlier InstructGPT, but on conversations. It was fine-tuned to minimize harmful, untruthful, or biased output using a combination of supervised learning and what OpenAI calls reinforcement learning from human feedback, in which humans rank potential outputs and a reinforcement learning algorithm rewards the model for generating outputs similar to those that rank highly. Strengths and weaknesses: Like other recent language models, ChatGPT’s output veers between stunningly brilliant and mind-numbingly stupid.
Behind the news: ChatGPT arrived one week after Meta withdrew Galactica, a model designed to generate scientific papers. Galactica was promoted as an aid to researchers aiming to publish their findings, but users of the public demo prompted it to generate sober dissertations on nonsensical topics like land squid and the health benefits of ingesting ground glass. Why it matters: Speech is among the simplest and most convenient ways for humans to communicate. Programs that grasp what they’re told and respond with meaningful information will open a wide range of everyday functions. Closer to home, many observers proposed ChatGPT or something like it as a superior alternative to current web search. First, though, researchers face the steep challenge of building a language model that doesn’t make up facts and ignore limits on its output. We’re thinking: Sometimes technology is overhyped — reinforcement learning, after solving Atari games, may be an example — but large language models are likely to find a place in significant applications. Meanwhile, many details remain to be worked out and the AI community must strive to minimize potential harm.
Cryptocurrency Unsafe for AIThe collapse of cryptocurrency exchange FTX threatens funding for some teams devoted to AI safety. What’s new: FTX, the $32 billion exchange that plunged into bankruptcy last month amid allegations of fraud, had given or promised more than $530 million to over 70 AI-related organizations, The New York Times reported. Much of that money may have to be returned. What happened: FTX founder Sam Bankman-Fried and his associates used the exchange’s holdings to dole out grants or investments to AI-related startups, labs, and think tanks, many of them focused on AI safety. People associated with these groups anonymously expressed concerns that their funding would be clawed back in bankruptcy proceedings.
Behind the news: Bankman-Fried co-founded FTX in 2019 to enable users to trade cryptocurrency for conventional money and other assets. A November report by CoinDesk, a cryptocurrency news outlet, described a potential conflict of interest between FTX and another trading firm also owned by Bankman-Fried. The news prompted users to withdraw their funds, much of which FTX had already spent, invested, given away, or promised to others. The exchange filed for bankruptcy. U.S. prosecutors and regulators are investigating potential wrongdoing. We’re thinking: We’re grateful for donors who are willing to support AI research of all kinds. At the same time, we’re appalled by the scope and brazenness of FTX’s deceit. Sadly, organizations that seek funding must vet potential donors carefully.
A MESSAGE FROM DEEPLEARNING.AIJoin Sebastián Ramírez, the creator of FastAPI, to build your own AI image-generation web app. “FastAPI for Machine Learning: Live Coding an ML Web Application” takes place on December 15, 2022, at 9:00 a.m. Pacific Time. RSVP
How Alexa Says GoodnightToo exhausted (or unimaginative) to tell your child a bedtime story? Amazon’s smart displays can spin bespoke tales on demand. What’s new: A feature called Create with Alexa generates children’s stories complete with illustrations, music, and sound effects on the Amazon Echo Show device. How it works: The screen presents a series of prompts that provide a setting (such as “space exploration” or “enchanted forest”), main character (such as an astronaut or an alien), principal color, and tone (such as “happy” or “mysterious”).
Behind the news: Amazon is under pressure to revitalize its 10-year-old Echo line. The devices, which have been sold at a loss on the theory that they would spur purchases of other goods, lost $10 billion in 2022 alone, and the division responsible for the Alexa software faces steep layoffs.
Seeing What Comes NextIf a robot can predict what it’s likely to see next, it may have a better basis for choosing an appropriate action — but it has to predict quickly. Transformers, for all their utility in computer vision, aren’t well suited to this because of their steep computational and memory requirements. A new approach could change that. What’s new: Agrim Gupta and colleagues at Stanford devised Masked Visual Pre-Training for Video Prediction (MaskViT), a transformer model that generates likely future video frames with far less computation than earlier transformer-based approaches. You can see its output here. Key insight: Transformers typically predict one token per forward pass (processing every layer in the model from first to last). The amount of processing required for this approach is manageable when generating an image, which may be divided among hundreds or thousands of tokens. But it becomes very time-consuming when generating video, which involves many images. Predicting multiple tokens at once reduces the number of forward passes needed to generate video, significantly accelerating the process. How it works: MaskViT consists of an image tokenizer (VQ-GAN, a discrete variational autoencoder) and a transformer. The authors trained and tested it on three video datasets: RoboNet (15 million frames that depict robotic arms interacting with objects), BAIR (a smaller dataset that shows a robot pushing things on a table top), and KITTI (57 videos recorded from a car driving on roads in Germany). The model generated 10 to 25 video frames, depending on the dataset, following between one and five initial frames, depending on the dataset.
Results: The authors compared their model’s efficiency at inference with that of earlier transformer-based approaches. On BAIR, for instance, MaskViT required 24 forward passes to generate 15 frames, while the previous state of the art, VT, needed 3,840. With respect to its predictive ability, on BAIR, MaskViT achieved 93.7 Fréchet Video Distance (FVD), a measure of how well a generated distribution resembles the original distribution, for which lower is better. That’s better than VT (94.0 FVD) and roughly equal to the best non-transformer approach, FitVid (93.6 FVD). On the more complicated RoboNet dataset, MaskViT achieved 133.5 FVD, while FitVid achieved 62.5 FVD. (VT results on that dataset are not reported.) Yes, but: The authors compared numbers of forward passes at inference, but they didn’t compare processing time. Different models take different amounts of time to run, so there’s no guarantee that a smaller number of forward passes takes less time. That said, given differences between the options for hardware, machine learning libraries, and programming languages, it would be hard to compare execution speeds directly. Why it matters: While the reduction of forward passes is notable, the authors also came up with an interesting way to improve output quality. During inference, 100 percent of the tokens to be generated start out missing and fill in slowly over the generation process. However, in the typical training practice, which masks a fixed percentage of tokens, the model never encounters such a large percentage of missing tokens. Instead, during training, the authors masked a variable portion of tokens up to 100 percent. This procedure better aligned the tasks during training and inference, which yielded better results. We’re thinking: Giving robots the ability to predict visual changes could make for a generation of much safer and more capable machines. We look forward to future work that integrates this capability with planning algorithms.
Subscribe and view previous issues here.
Thoughts, suggestions, feedback? Please send to thebatch@deeplearning.ai. Avoid our newsletter ending up in your spam folder by adding our email address to your contacts list.
|