|
Dear friends,
I’m thrilled that former students and postdocs of mine won both of this year’s NeurIPS Test of Time Paper Awards. This award recognizes papers published 10 years ago that have significantly shaped the research field. The recipients included Ian Goodfellow (who, as an undergraduate, built my first GPU server for deep learning in his dorm room) and his collaborators for their work on generative adversarial networks, and my former postdoc Ilya Sutskever and PhD student Quoc Le (with Oriol Vinyals) for their work on sequence-to-sequence learning. Congratulations to all these winners!
A lesson I carry from that time is to not worry about what others think, but follow your convictions, especially if you have data to support your beliefs. Small-scale experiments performed by my Stanford group convinced me that scaling up neural networks would drive significant progress, and that’s why I was willing to ignore the skeptics. The diagram below, generated by Adam Coates and Honglak Lee, is the one that most firmed up my beliefs at that time. It shows that, for a range of models, the larger we scaled them, the better they perform. I remember presenting it at CIFAR 2010, and if I had to pick a single reason why I pushed through to start Google Brain and set as the team’s #1 goal to scale up deep learning algorithms, it is this diagram!
I also remember presenting at NeurIPS in 2008 our work on using GPUs to scale up training neural networks. (By the way, one measure of success in academia is when your work becomes sufficiently widely accepted that no one cites it anymore. I’m quite pleased the idea that GPUs should be used for AI — which was controversial back then — is now such a widely accepted “fact” that no one bothers to cite early papers that pushed for it. 😃)
When I started Google Brain, the thesis was simple: I wanted to use the company’s huge computing capability to scale up deep learning. Shortly afterward, I built Stanford’s first supercomputer for deep learning using GPUs, since I could move faster at Stanford than within a large company. A few years later, my team at Baidu showed that as you scale up a model, its performance improves linearly on a log-log scale, which was a precursor to OpenAI’s scaling laws.
Keep learning, Andrew
A MESSAGE FROM DEEPLEARNING.AIIn our latest short course, you’ll learn how to use OpenAI o1 for advanced reasoning in tasks like coding, planning, and image analysis. Explore tradeoffs between intelligence gains and cost as well as techniques, such as meta prompting, to optimize performance. Enroll now!
News
Phi-4 Beats Models Five Times Its SizeMicrosoft updated its smallest model family with a single, surprisingly high-performance model. What’s new: Marah Abdin and a team at Microsoft released Phi-4, a large language model of 14 billion parameters that outperforms Llama 3.3 70B and Qwen 2.5 (72 billion parameters) on math and reasoning benchmarks. The model is available at Azure AI Foundry under a license that permits non-commercial uses, and the weights will be released via Hugging Face next week. How it works: Phi-4 is a transformer that processes up to 16,000 tokens of input context. The ways the authors constructed the pretraining and fine-tuning datasets accounts for most of its performance advantage over other models.
Results: Of 13 benchmarks, Phi-4 outperforms Llama 3.3 70B (its most recent open weights competitor) on six and Qwen 2.5 on five.
Why it matters: Phi-4 shows that there’s still room to improve the performance of small models by curating training data, following the age-old adage that better data makes a better model. We’re thinking: Some researchers found that earlier versions of Phi showed signs of overfitting to certain benchmarks. In their paper, the Microsoft team stressed that they had improved the data decontamination process for Phi-4 and added an appendix on their method. We trust that independent tests will show that Phi-4 is as impressive as its benchmark scores suggest.
Open Video Gen Closes the GapThe gap is narrowing between closed and open models for video generation. What’s new: Tencent released HunyuanVideo, a video generator that delivers performance competitive with commercial models. The model is available as open code and open weights for developers who have less than a 100 million monthly users and live outside the EU, UK, and South Korea. How it works: HunyuanVideo comprises a convolutional video encoder-decoder, two text encoders, a time-step encoder, and a transformer. The team trained the model in stages (first the encoder-decoder, then the system as a whole) using undisclosed datasets before fine-tuning the system.
Results: 60 people judged responses to 1,533 text prompts by HunyuanVideo, Gen-3 and Luma 1.6. The judges preferred HunyuanVideo’s output overall. Examining the systems’ output in more detail, they preferred HunyuanVideo’s quality of motion but Gen-3’s visual quality. Behind the news: In February, OpenAI’s announcement of Sora (which was released as this article was in production) marked a new wave of video generators that quickly came to include Google Veo, Meta Movie Gen, Runway Gen-3 Alpha, and Stability AI Stable Video Diffusion. Open source alternatives like Mochi continue to fall short of publicly available commercial video generators. Why it matters: Research in image generation has advanced at a rapid pace, while progress in video generation has been slower. One reason may be the cost of processing, which is especially intensive when it comes to video. The growing availability of pretrained, open source video generators could accelerate the pace by relieving researchers of the need to pretrain models and enabling them to experiment with fine-tuning and other post-training for specific tasks and applications. We’re thinking: Tencent’s open source models are great contributions to research and development in video generation. It’s exciting to see labs in China contributing high-performance models to the open source community!
Learn More About AI With Data Points!AI news is moving faster than ever. Data Points helps you make sense of it just as fast. Data Points arrives in your inbox twice a week with six brief news stories. This week, we covered updates to ChatGPT’s Projects and Canvas and its new video capabilities in Advanced Voice Mode. Read the latest issues and subscribe today!
Multimodal Modeling on the DoubleGoogle’s Gemini 2.0 Flash, the first member of its updated Gemini family of large multimodal models, combines speed with performance that exceeds that of its earlier flagship model, Gemini 1.5 Pro, on several measures. What’s new: Gemini 2.0 Flash processes an immense 2 million tokens of input context including text, images, video, and speech, and generates text, images, and speech. Text input/output is available in English, Spanish, Japanese, Chinese, and Hindi, while speech input/output is available in English only for now. It can use tools, generate function calls, and respond to a real-time API — capabilities that underpin a set of pre-built agents that perform tasks like research and coding. Gemini 2.0 Flash is available for free in an experimental preview version via Google AI Studio, Google Developer API, and Gemini Chat. How it works: Gemini 2.0 Flash (parameter count undisclosed) matches or outperforms several competing models on key benchmarks, according to Google’s report.
Agents at your service: Google also introduced four agents that take advantage of Gemini 2.0 Flash’s ability to use tools, call functions, and respond to the API in real time. Most are available via a waitlist.
Behind the news: OpenAI showed off GPT-4o’s capability for real-time video understanding in May, but Gemini 2.0 Flash beat it to the punch: Google launched the new model and its multimodal API one day ahead of ChatGPT’s Advanced Voice with Vision. Why it matters: Speed and multimodal input/output are valuable characteristics for any AI model, and they’re especially useful in agentic applications. Google CEO Sundar Pichai said he wants Gemini to be a “universal assistant.” The new Gemini-based applications for coding, research, and video analysis are steps in that direction. We’re thinking: While other large language models can take advantage of search, Gemini 2.0 Flash generates calls to Google Search and uses that capability in agentic tools — a demonstration of how Google’s dominance in search strengthens its efforts in AI.
When LLMs Propose Research IdeasHow do agents based on large language models compare to human experts when it comes to proposing machine learning research? Pretty well, according to one study. What’s new: Chenglei Si, Diyi Yang, and Tatsunori Hashimoto at Stanford produced ideas for research in machine learning using Anthropic’s Claude 3.5 Sonnet and human researchers, and also evaluated them using both manual and automated methods. Claude 3.5 Sonnet generated competitive proposals, but its evaluations of proposals were less compelling. How it works: Each proposal included a problem statement, motivation, step-by-step plan, backup plan, and examples of baseline outcomes versus expected experimental outcomes.
Results: Human judges deemed proposals generated by Claude 3.5 Sonnet as good as or better than those produced by humans. However, large language models proved less effective at judging the proposals’ quality.
Why it matters: AI models play a growing role in scientific discovery. This work shows they can set directions for research — in machine learning, at least — that rival those set by humans. However, human evaluation remains the gold standard for comparing performance on complex problems like generating text. We’re thinking: Coming up with good research ideas is hard! That a large language model can do it with some competency has exciting implications for the future of both AI and science.
Work With Andrew Ng
Join the teams that are bringing AI to the world! Check out job openings at DeepLearning.AI, AI Fund, and Landing AI.
Subscribe and view previous issues here.
Thoughts, suggestions, feedback? Please send to thebatch@deeplearning.ai. Avoid our newsletter ending up in your spam folder by adding our email address to your contacts list.
|