Dear friends,
A small number of people are posting text online that’s intended for direct consumption not by humans, but by LLMs (large language models). I find this a fascinating trend, particularly when writers are incentivized to help LLM providers better serve their users!
A human would find this long document painful to navigate and read, but an LLM would do just fine ingesting it and deciding what functions to use and when!
Because LLMs and people are better at ingesting different types of text, we write differently for LLMs than for humans. Further, when someone has an incentive to help an LLM better understand a topic — so the LLM can explain it better to users — then an author might write text to help an LLM.
Keep learning! Andrew
P.S. I like LLMs, but I like humans even more. So please keep writing text for humans as well. 😀
MESSAGES FROM DEEPLEARNING.AILearn how to develop applications with large language models by building AI-powered games! Gain essential skills by designing a shareable text-based game and integrating safety features. If you’ve completed our AI Python for Beginners series or want to improve your coding skills in a fun, interactive way, this course is for you! Start today
News
Next-Gen Models Show Limited GainsBuilders of large AI models have relied on the idea that bigger neural networks trained on more data and given more processing power would show steady improvements. Recent developments are challenging that idea. What’s new: Next-generation large language models from OpenAI, Google, and Anthropic are falling short of expectations, employees at those companies told multiple publications. All three companies are responding by shifting their focus from pretraining to enhancing performance through techniques like fine-tuning and multi-step inference. Scaling law basics: A classic 2020 paper shows that, assuming a sufficient quantity of data, a transformer network’s performance rises predictably with increases in model size (demonstrated between 768 parameters and 1.5 billion parameters). Likewise, assuming sufficient model size, performance rises predictably with increases in dataset size (demonstrated between 22 million tokens and 23 billion tokens). Furthermore, performance rises predictably with increases in both model and dataset sizes. The 2022 Chinchilla paper shows that, to build an optimal model, every 4x increase in compute requires a 2x increase in the size of the model and dataset (demonstrated for models between 70 million and 16 billion parameters, trained on between 5 billion and 500 billion tokens). Due to limited experimentation and lack of a theoretical basis of their findings, the authors didn’t determine whether these relationships would continue to hold at larger scales. Diminishing returns: Major AI companies have been counting on scaling laws to keep their models growing more capable at a steady pace. However, the next generation of high-profile models has not shown the expected improvements despite larger architectures, more training data, and more processing power.
What they’re saying: AI leaders are divided on the future of scaling laws as they are currently understood.
Why it matters: AI’s phenomenal advance has drawn hundreds of millions of users and sparked a new era of progress and hope. Slower-than-expected improvements in future foundation models may blunt this progress. At the same time, the cost of training large AI models is rising dramatically. The latest models cost as much as $100 million to train, and this number could reach $100 billion within a few years, according to Anthropic’s Dario Amodei. Rising costs could lead companies to reallocate their gargantuan training budgets and researchers to focus on more cost-effective, application-specific approaches.
No Game Engine RequiredA real-time video generator lets you explore an open-ended, interactive virtual world — a video game without a game engine. What’s new: Decart, a startup that’s building a platform for AI applications, and Etched, which designs specialized AI chips, introduced Oasis, which generates a Minecraft-like game in real time. The weights are open and available here. You can play with a demo here. How it works: The system generates one frame at a time based on a user’s keystrokes, mouse movements, and previously generated frames. The training dataset is undisclosed, but it’s almost certainly based on videos of Minecraft gameplay, given the output’s striking semblance to that game.
Results: The Oasis web demo enables users to interact with 360-by-360-pixel frames at 20 frames per second. Users can place blocks, place fences, and move through a Minecraft-like world. The demo starts with an image of a location, but users can upload an image (turning, say, a photo of your cat into a blocky Minecraft-style level, as reported by Wired). Yes, but: The game has its fair share of issues. For instance, objects disappear and menus items change unaccountably. The world’s physics are similarly inconsistent. For instance, players don’t fall into holes dug directly beneath them and, after jumping into water, players are likely to find themselves standing on a blue floor. Behind the news: In February, Google announced Genie, a model that generates two-dimensional platformer games from input images. We weren’t able to find a publicly available demo or model. Why it matters: Oasis is more a proof of concept than a product. Nonetheless, as an open-world video game entirely generated by AI — albeit based on data produced by a traditional implementation — it sets a bar for future game generators. We’re thinking: Real-time video generation suggests a wealth of potential applications — say, a virtual workspace for interior decorating that can see and generate your home, or an interactive car repair manual that can create custom clips based on your own vehicle. Oasis is an early step in this direction.
Introducing Data PointsAI news is moving faster than ever. Data Points helps you make sense of it just as fast. Data Points, a sister publication of The Batch, arrives in your inbox twice a week with six brief news stories in each issue. Read the latest issues and subscribe today
Further Chip Restrictions on ChinaThe largest manufacturer of AI chips told its Chinese customers it would stop fabricating their most advanced designs, further limiting China’s access to AI hardware. What’s new: Taiwan Semiconductor Manufacturing Corp. (TSMC) notified Alibaba, Baidu, and others it would halt production of their most advanced chips starting November 13, according to multiple reports. The restriction affects chip designs that are based on manufacturing processes at scales of 7 nanometers and below. TSMC must receive explicit permission from the U.S. government to manufacture advanced chips for a given customer, which likely would require that the government assess each chip to prevent potential military applications. How it works: The United States Department of Commerce ordered TSMC to halt shipments of advanced AI chips to China after a chip fabricated by TSMC was discovered in an AI system sold by the Chinese telecoms giant Huawei, apparently in violation of earlier U.S. controls, Reuters reported. Taiwan’s economic ministry said it would follow all domestic and international regulations.
Behind the news: The U.S.-China chip standoff began in 2020 and has escalated since. Initial restrictions barred U.S.-based companies like AMD, Intel, and Nvidia from selling advanced chips to Huawei and affiliated Chinese firms. China responded by promoting domestic chip fabrication. In 2022, the U.S. passed the CHIPS and Science Act to boost its own chip industry, seeking to counter China and decrease U.S. reliance on Taiwan. Why it matters: TSMC finds itself in the middle of an AI arms race in which cutting-edge chips could tip the balance. The company itself, which has been operating at full capacity, is unlikely to suffer business losses. We’re thinking: AI developers in China have been resourceful in navigating previous restrictions. Chip manufacturing is extraordinarily difficult to master, but China has made strides in this direction. A proliferation of factories that can fabricate advanced chips would reshape AI research and business worldwide.
More-Efficient Training for TransformersResearchers cut the processing required to train transformers by around 20 percent with only a slight degradation in performance. What’s new: Xiuying Wei and colleagues at Swiss Federal Institute of Technology Lausanne replaced a transformer’s linear layers with approximations based on computationally efficient low-rank linear layers. Key insight: A low-rank approximation replaces a matrix with a product of two smaller matrices. This technique is widely used to streamline fine-tuning via LoRA, which modifies the weights in each of a transformer’s linear layers by adding a learned low-rank approximation. As a direct replacement for the weights in linear layers, low-rank approximation saves processing during training, but it also causes unstable fluctuations in the training loss and slower convergence. The authors mitigated these undesirable effects by training each full-size layer in parallel with a low-rank approximation of the layer while gradually phasing out the full-size layer. This approach costs more memory and computation initially, but it saves those resources in the long run. How it works: The authors modified a transformer (1.3 billion parameters) to use low-rank approximation (which trimmed the parameter count to 985 million). They trained both models on 25.5B tokens of text scraped from the web, filtered, and deduplicated.
Results: The authors tested both the modified and full-size transformers on 500 million tokens from the validation set according to perplexity (a measure of the likelihood that a model will predict the next word, lower is better). The modified version achieved 12.86 perplexity, slightly worse than the full-size version’s 12.46 perplexity. However, training the modified version required more than 20 percent less processing and 14 percent less time. The modified transformer used 1.66*10^20 FLOPS and took 302 hours, while the full-size version used 2.10*10^20 FLOPS and took 352 hours. Why it matters: Training large transformers requires a lot of computation. Low-rank approximation lightens the processing load. This work approximates a transformer's linear layers to save memory, while the earlier GaLore approximates the gradient to save optimizer memory. We’re thinking: The authors note that this approach also works for fine-tuning pretrained models — a potential alternative to LoRA. Simply replace each pretrained linear layer (with weights W) with two linear layers (with weights U and V), and initialize U and V such that W = UV.
Work With Andrew Ng
Join the teams that are bringing AI to the world! Check out job openings at DeepLearning.AI, AI Fund, and Landing AI.
Subscribe and view previous issues here.
Thoughts, suggestions, feedback? Please send to thebatch@deeplearning.ai. Avoid our newsletter ending up in your spam folder by adding our email address to your contacts list.
|