Dear friends,
Happy sum(i**3 for i in range(10)) !
Despite having worked on AI since I was a teenager, I’m now more excited than ever about what we can do with it, especially in building AI applications. Sparks are flying in our field, and 2025 will be a great year for building!
One aspect of AI that I’m particularly excited about is how easy it is to build software prototypes. AI is lowering the cost of software development and expanding the set of possible applications. While it can help extend or maintain large software systems, it shines particularly in building prototypes and other simple applications quickly.
If you want to build an app to print out flash cards for your kids (I just did this in a couple of hours with o1’s help), or write an application that monitors foreign exchange rates to manage international bank accounts (a real example from DeepLearning.AI’s finance team), or analyzes user reviews automatically to quickly flag problems with your products (DeepLearning.AI's content team does this), it is now possible to build these applications quickly through AI-assisted coding.
I find AI-assisted coding especially effective for prototyping because (i) stand-alone prototypes require relatively little context and software integration and (ii) prototypes in alpha testing usually don’t have to be reliable. While generative AI also helps with engineering large, mission-critical software systems, the improvements in productivity there aren't as dramatic, because it’s challenging to give the AI system all the context it needs to navigate a large codebase and also to make sure the generated code is reliable (for example, covering all important corner cases).
Until now, a huge friction point for getting a prototype into users’ hands has been deployment. Platforms like Bolt, Replit Agent, Vercel V0 use generative AI with agentic workflows to improve code quality, but more importantly, they also help deploy generated applications directly. (While I find these systems useful, my own workflow typically uses an LLM to design the system architecture and then generate code, one module at a time if there are multiple large modules. Then I test each module, edit the code further if needed — sometimes using an AI-enabled IDE like Cursor — and finally assemble the modules.)
Building prototypes quickly is an efficient way to test ideas and get tasks done. It’s also a great way to learn. Perhaps most importantly, it’s really fun! (At least I think it is. 😄)
How can you take advantage of these opportunities in the coming year? As you form new year resolutions, I hope you will:
- Make a learning plan! To be effective builders, we all need to keep up with the exciting changes that continue to unfold. How many short courses a month do you want to take in 2025? If you discuss your learning plan with friends, you can help each other along. For instance, we launched a learning summary page that shows what short courses people have taken. A few DeepLearning.AI team members have agreed to a friendly competition to see who can take more courses in 2025!
- Go build! If you already know how to code, I encourage you to build prototypes whenever inspiration strikes and you have a spare moment. And if you don’t yet code, it would be well worth your while to learn! Even small wins — like the flash cards I printed out, which inspired my daughter to spend an extra 20 minutes practicing her multiplication table last night — make life better. Perhaps you’ll invent something that really takes off. And even if you don’t, you’ll have fun and learn a lot along the way.
Happy New Year! Andrew
P.S. I develop mostly in Python. But if you prefer JavaScript: Happy Array.from({ length: 10 }, (_, i) => i ** 3).reduce((a, b) => a + b, 0) !
We stand at the threshold of a new era: One in which AI systems possess striking abilities to reason about the world, grasp our wishes, and take actions to fulfill them. What will we do with these powers? We asked leaders of the field to share their hopes for the coming year. As in our previous New Year special issues, their answers offer inspiring views of what we may build and the good we can bring.
Hanno Basse: Generative AI for Artists
Stability AI’s aim is to liberate artists of all trades from the repetitive, mechanical aspects of their work and help them spend the majority of their time on the creative side. So our highest hope for next year is that generative AI will help people to be more creative and productive. In addition, I hope the AI community will focus on:
- Safety and integrity: Building safe products by embedding integrity from the earliest stages of development, ensuring the technology is used responsibly and makes a meaningful contribution to the art of storytelling.
- Accessibility: Generative AI products and tools must be accessible and usable for the broadest possible audience. Currently, much of generative AI remains accessible primarily to individuals who have advanced technical expertise, such as engineers. To address this, we need to develop much better tooling on top of foundational models, so they provide value to a diverse audience.
- Customization: Looking ahead, we expect generative AI to become increasingly specialized. Alongside large foundational models, we expect a significant rise in smaller, fine-tuned models tailored for specific and often quite narrow use cases and applications, even down to the level of a single task. This is where the true potential of generative AI will come to bear. Moreover, it is the safest and most responsible way to deploy generative AI in the real world.
Hanno Basse is Chief Technology Officer of Stability AI. Previously he served as CTO of Digital Domain, Microsoft Azure Media and Entertainment, and 20th Century Fox Film Corp.
David Ding: Generated Video With Music, Sound Effects, and Dialogue
Last year, we saw an explosion of models that generate either video or audio outputs in high quality. In the coming year, I look forward to models that produce video clips complete with audio soundtracks including speech, music, and sound effects. I hope these models will bring a new era of cinematic creativity.
The technologies required for such cinematic video generators are in place. Several companies provide very competitive video models, and Udio and others create music models. All that’s left is to model video and audio simultaneously, including dialog and voiceovers. (In fact, we’ve already seen something like this: Meta’s Movie Gen. Users describe a scene and Movie Gen will produce a video clip complete with a music score and sound effects.)
Of course, training such models will require extensive datasets. But I suspect that the videos used to train existing video generators had soundtracks that include these elements, so data may not be a barrier to developing these models. Initially, these models won’t produce output that competes with the best work of professional video editors. But they will advance quickly. Before long, they’ll generate videos and soundtracks that approach Hollywood productions in raw quality, just as current image models can produce images that are indistinguishable from high-end photographs. At the same time, the amount of control users have over the video and audio outputs will continue to increase. For instance, when we first released Udio, users couldn’t control the harmony it generated. A few months later, we launched an update that enables users to specify the key, or tonal center. So users can take an existing song and remix it in a different key. We are continuing to do research into giving users additional levers of control, such as voice, melody, and beats, and I’m sure video modeling teams are doing similar research on controllability. Some people may find the prospect of models that generate fully produced cinematic videos unsettling. I understand this feeling. I enjoy photography and playing music, but I’ve found that image and audio generators are helpful starting points for my creative work. If I choose, AI can give me a base image that I can work on in Photoshop, or a musical composition to sample from or build on. Or consider AI coding assistants that generate the files for an entire website. You no longer need to rely on web developers, but if you talk to them, you’ll learn that they don’t always enjoy writing the boilerplate code for a website. Having a tool that builds a site’s scaffold lets them spend their time on development tasks they find more stimulating and fun. In a similar way, you’ll be able to write a screenplay and quickly produce a rough draft of what the movie might look like. You might generate 1,000 takes, decide which one you like, and draw inspiration from that to guide a videographer and actors. Art is all about the creative choices that go into it. Both you and I can use Midjourney to make a picture of a landscape, but if you’re an artist and you have a clear idea of the landscape you want to see, your Midjourney output will be more compelling than mine. Similarly, anyone can use Udio to make high-production quality music, but if you have good musical taste, your music will be better than mine. Video will remain an art form, because individuals will choose what their movie is about, how it looks, and how it feels — and they’ll be able to make those choices more fluidly, quickly, and interactively. David Ding is a lifelong musician and co-founder of Udio, maker of a music-creation web app that empowers users to make original music. Previously, he was a Senior Research Engineer at Google DeepMind.
Joseph Gonzalez: General Intelligence
In 2025, I expect progress in training foundation models to slow down as we hit scaling limits and inference costs continue to rise. Instead, I hope for an explosion of innovation on top of AI, such as the rapidly developing agents stack. I hope we will see innovation in how we combine AI with tools and existing systems to deliver exciting new capabilities and create new product categories. Perhaps most of all, I am excited to see how people change in response to this new world. We have achieved AGI. Now what? Let’s start with — and hopefully end — the longstanding debate around artificial general intelligence (AGI). I know this is controversial, but I think we have achieved AGI, at least definitionally: Our AI is now general. I will leave the longer debate about sentience and superintelligence to the philosophers and instead focus on the key innovation: generality. The artificial intelligence or machine learning of previous decades was intelligent but highly specialized. It could often surpass human ability on a narrowly defined task (such as image recognition or content recommendation). Models today, and perhaps more importantly the systems around them, are capable of accomplishing a very wide range of tasks often as well as, and in some cases better, than humans. It is this generality that will allow engineers, scientists, and artists to use these models to innovate in ways that the model developers never imagined. It is also this generality, combined with market forces, that will make 2025 so exciting. Becoming AI-native: The generality of these models and their natural language interfaces mean that everyone can use and explore AI. And we are! We are learning to explain our situations to machines, give context and guidance, and expect personalized answers and solutions. At RunLLM, where I’m a co-founder, we’re building high-quality technical support agents. We find that users increasingly use our agents not just to solve problems but to personalize solutions to their specific tasks. We’ve also found — to our surprise — that users share much more with an AI than they would share with another person. Meanwhile, at UC Berkeley, I am impressed by students who use AI to re-explain my lecture or study from an AI-generated practice exam. They have found ways to use AI to help personalize and improve their learning experiences. In 2025, maybe we will begin to prefer AIs over humans when we need help or are trying to learn.
Across all these use cases, we’re clearly getting better at working around the limitations of large language models and using AI in ways I would not have imagined 12 months ago. Return on AI: The focus in 2025 will turn to showing real value from past investments. Investors and enterprises will expect startups and enterprise AI teams to transition from exploring to solving real problems — reducing cost, generating revenue, improving customer experience, and so on. This is bad news for academics who need to raise research funds (DM me if you have any leftover funds from fiscal year 2024) but great news for everyone else, who will ride the wave of new AI-powered features. There will be a race to find innovative ways to incorporate AI into every aspect of a product and business. In many cases, we will see hastily executed chatbots and auto-summarization features — the first step on the AI journey. I hope these will be quickly replaced by contextual agents that adapt to users’ needs and learn from their interactions. The pandemic paved the way for remote (digital) assistants and exposed a virtually accessible workplace with the tools needed for tomorrow’s agents. These agents likely will specialize in filling roles once held by people or maybe filling new roles created by other agents. Perhaps we will know that AI has delivered on its promise when everyone manages their own team of custom agents. Chat is only the beginning: My hope for 2025 is that we move beyond chatting and discover how to use AI to do great things! I hope we will see AI agents that work in the background, invisibly helping us with our daily tasks. They will surface the right context as we make decisions and help us learn as the world changes. Through context and tools, they will let us know what we are missing and catch the balls we drop. We will chat less and our AI powered agents will accomplish more on our behalf. I look forward to the day when I can confidently step away from a keyboard and focus on the human interactions that matter. Joseph Gonzalez is a professor at UC Berkeley, a co-founder of RunLLM, and an advisor to Genmo and Letta.
Albert Gu: More Learning, Less Data
Building a foundation model takes tremendous amounts of data. In the coming year, I hope we’ll enable models to learn more from less data.
The AI community has achieved remarkable success by scaling up transformers and datasets. But this approach may be reaching a point of diminishing returns — an increasingly widespread belief among the pretraining community as they try to train next-generation models. In any case, the current approach poses practical problems. Training huge models on huge datasets consumes huge amounts of time and energy, and we’re running out of new sources of data for training large models. The fact is, current models consume much more data than humans require for learning. We’ve known this for a while, but we’ve ignored it due to the amazing effectiveness of scaling. It takes trillions of tokens to train a model but orders of magnitude less for a human to become a reasonably intelligent being. So there’s a difference in sample efficiency between our best models and humans. Human learning shows that there’s a learning algorithm, objective function, architecture, or a combination thereof that can learn more sample-efficiently than current models. One of the keys to solving this problem is enabling models to produce higher-level abstractions and filter out noise. I believe this concept, and thus the general problem of data efficiency, is related to several other current problems in AI:
- Data curation: We know that the specific data we use to train our models is extremely important. It’s an open secret that most of the work that goes into training foundation models these days is about the data, not the architecture. Why is this? I think it’s related to the fact that our models don’t learn efficiently. We have to do the work ahead of time to prepare the data for a model, which may hinder the core potential of AI as an automatic process for learning from data.
- Feature engineering: In deep learning, we always move toward more generalized approaches. From the beginning of the deep learning revolution, we’ve progressively removed handcrafted features such as edge detectors in computer vision and n-grams in natural language processing. But that engineering has simply moved to other parts of the pipeline. Tokenization, for instance, involves engineering implicit features. This suggests that there’s still a lot of room to make model architectures that are more data-efficient and more generally able to handle more raw modalities and data streams.
- Multimodality: The key to training a model to understand a variety of data types together is figuring out the core abstractions in common and relating them to each other. This should enable models to learn from less data by leveraging all the modalities jointly, which is a core goal of multimodal learning.
- Interpretability and robustness: To determine why a model produced the output it did, it needs to be able to produce higher-level abstractions, and we need to track the way it captures those abstractions. The better a model is at doing this, the more interpretable it should be, the more robust it should be to noise, and likely the less data it should need for learning.
- Reasoning: Extracting higher-level patterns and abstractions should allow models to reason better over them. Similarly, better reasoning should mean less training data.
- Democratization: State-of-the-art models are expensive to build, and that includes the cost of collecting and preparing enormous amounts of data. Few players can afford to do it. This makes developments in the field less applicable to domains that lack sufficient data or wealth. Thus more data-efficient models would be more accessible and useful.
Considering data efficiency in light of these other problems, I believe they’re all related. It’s not clear which is the cause and which are the effects. If we solve interpretability, the mechanisms we engineer may lead to models that can extract better features and lead to more data-efficient models. Or we may find that greater data efficiency leads to more interpretable models.
Either way, data efficiency is fundamentally important, and progress in that area will be an indicator of broader progress in AI. I hope to see major strides in the coming year.
Albert Gu is an Assistant Professor of Machine Learning at Carnegie Mellon University and Chief Scientist of Cartesia AI. He appears on Time’s list of the most influential people in AI in 2024.
Subscribe and view previous issues here.
Thoughts, suggestions, feedback? Please send to thebatch@deeplearning.ai. Avoid our newsletter ending up in your spam folder by adding our email address to your contacts list.
|