Dear friends,
AI’s usefulness in a wide variety of applications creates many opportunities for entrepreneurship. In this letter, I’d like to share what might be a counter-intuitive best practice that I’ve learned from leading AI Fund, a venture studio that has built dozens of startups with extraordinary entrepreneurs. When it comes to building AI applications, we strongly prefer to work on a concrete idea, meaning a specific product envisioned in enough detail that we can build it for a specific target user.
Articulating a concrete idea — which is more likely than a vague idea to be wrong — takes more courage. The more specific an idea, the more likely it is to be a bit off, especially in the details. The general area of AI for livestock farming seems promising, and surely there will be good ways to apply AI for livestock. In contrast, specifying a concrete idea, which is much easier to invalidate, is scary. The benefit is that the clarity of a specific product vision lets a team execute much faster. One strong predictor of how likely a startup is to succeed is the speed with which it can get stuff done. This is why founders with clarity of vision tend to be desired; clarity helps drive a team in a specific direction. Of course, the vision has to be a good one, and there’s always a risk of efficiently building something that no one wants to buy! But a startup is unlikely to succeed if it meanders for too long without forming a clear, concrete vision.
Building toward something concrete — if you can do so in a responsible way that doesn’t harm others — lets you get critical feedback more efficiently and, if necessary, switch directions sooner. (See my letter on when it’s better to go with a “Ready, Fire, Aim” approach to projects.) One factor that favors this approach is the low cost of experimenting and iterating. This is increasingly the case for many AI applications, but perhaps less so for deep-tech AI projects.
I realize that this advice runs counter to common practice in design thinking, which warns against leaping to a solution too quickly, and instead advocates spending time understanding end-users, deeply understanding their problems, and brainstorming a wide range of solutions. If you’re starting without any ideas, then such an extended process can be a good way to develop good ideas. Further, keeping ideas open-ended can be good for curiosity-driven research, where investing to pursue deep tech with only a vague direction in mind can pay huge dividends over the long term.
Keep learning! Andrew
A MESSAGE FROM DEEPLEARNING.AILearn how to build secure, privacy-focused federated learning systems using the Flower framework in a new two-part short course. Start with the basics in “Intro to Federated Learning,” and explore advanced techniques in “Federated Fine-tuning of LLMs with Private Data.” Enroll for free
NewsMini but MightyA slimmed-down version of Open AI’s multimodal flagship packs a low-price punch. What’s new: OpenAI released GPT-4o mini, a smaller text-image-video-audio generative model that, according to the company, generally outperforms models from Google and Anthropic models of similar size at a lower price for API access. It newly underpins the free version of ChatGPT.
Behind the news: GPT-4o mini part of a July wave of smaller large language models.
Why it matters: Powerful multimodal models are becoming ever more widely available at lower prices, creating opportunities for developers and researchers alike. GPT-4o mini sets a new standard for others to beat. Its price may be especially appealing to developers who aim to build agentic workflows that require models to process large numbers of tokens on their way to producing output. We’re thinking: Not long ago, pushing the edge of large language models meant making them larger, with higher computing costs to drive rising parameter counts. But building bigger models has made it easier to develop smaller models that are more cost-effective and nearly as capable. It’s a virtuous circle: Costs fall and productivity rises to everyone’s benefit.
Meta Withholds Models From EuropeEuropean users won’t have access to Meta’s multimodal models. What’s new: Meta said it would withhold future multimodal models from the European Union (EU) to avoid being charged, banned, or fined for running afoul of the region’s privacy laws, according to Axios. (The newly released Llama 3.1 family, which processes text only, will be available to EU users.)
Apple and OpenAI in Europe: Meta is not the only global AI company that’s wary of EU technology regulations.
Why it matters: Different regions are taking different paths toward regulating AI. The EU is more restrictive than others, creating barriers to AI companies that develop new technology and products. Meta and Apple are taking proactive steps to reduce their risks even if it means foregoing portions of the European market. We’re thinking: We hope regulators everywhere will think hard about how to strike a balance between protecting innovation and other interests. In this instance, the EU’s regulations have prompted Meta to make a decision that likely likely set back European AI while delivering little benefit to citizens.
AI Investors Hoard GPU PowerInvestors have been gathering AI chips to attract AI startups. What’s new: Venture-capital firms are stockpiling high-end graphics processing units (GPUs), according to a report by The Information. They’re using the hardware to provide processing power to their portfolio companies at reduced or no cost.
Behind the news: High-end GPUs were in short supply early last year. The shortage has eased significantly, but getting access to enough processing power to train and run large models still isn’t easy. A16Z follows several other investors that have sought to fill the gap for startups.
Yes, but: David Cahn, a partner at A16Z rival Sequoia Capital, argues that stockpiling GPUs is a mistake that could leave venture funds holding large quantities of expensive, rapidly depreciating, hardware. Cahn believes startups and small developers soon may have an easier time getting their hands on the processing power they need. Nvidia recently announced its new B100 and B200 GPUs, whose arrival should stanch demand for older units like the H100. Why it matters: AI startups are hot, and venture-capital firms compete for early equity in the most promising ones. In addition to funding, they frequently offer advice, contacts, office support — and now processing power to empower a startup to realize its vision. We’re thinking: Venture investors who use GPUs to sweeten a deal give new meaning to the phrase “bargaining chips.”
Expressive Synthetic Talking HeadsPrevious systems that produce a talking-head video from a photo and a spoken-word audio clip animate the lips and other parts of the face separately. An alternative approach achieves more expressive results by animating the head as a whole. What’s new: Sicheng Xu and colleagues at Microsoft developed VASA-1, a generative system that uses a facial portrait and spoken-word recording to produce a talking-head video with appropriately expressive motion. You can see its output here. Key insight: When a person speaks, the facial expression and head position change over time, while the overall shapes of the face and head don’t. By learning to represent an image via separate embeddings for facial expression and head position — which change — as well as for facial structure in its 2D and 3D aspects — which don’t — a latent diffusion model can focus on the parts of the image that matter most. (Latent diffusion is a variant of diffusion that saves computation by processing a small, learned vector of an image instead of the image itself.) How it works: VASA-1 comprises four image encoders (three 2D CNNs and one 3D CNN), one image decoder (another 2D CNN), Wav2Vec 2.0, and a latent diffusion image generator. The authors trained the system, given an image of a face and a recorded voice, to generate a series of video frames that conform to the voice. The training set was VoxCeleb2, which includes over 1 million short videos of celebrities talking. The authors added labels for gaze direction, head-to-camera distance, and an emotional intensity score computed by separate systems.
Results: The authors measured their results by training a model similar to CLIP that produces a similarity score on how well spoken audio matches a video of a person speaking (higher is better). On the VoxCeleb2 test set, their approach produced a similarity score of 0.468 compared to 0.588 for real video. The nearest contender, SadTalker, which generates lip, eye, and head motions separately, achieved a similarity score of 0.441. Why it matters: By learning to embed different aspects of a face separately, the system maintained the face’s distinctive, unchanging features while generating appropriate motions. This also made the system more flexible at inference: The authors demonstrated its ability to extract a video’s facial expressions and head movements and apply them to different faces. We’re thinking: Never again will we take talking-head videos at face value!
A MESSAGE FROM WORKERATest, benchmark, and grow your skills with new assessments from Workera! Available domains include AI Foundations, Machine Learning, GenAI, and MLOps. Try Workera today for $0!
Work With Andrew Ng
Join the teams that are bringing AI to the world! Check out job openings at DeepLearning.AI, AI Fund, and Landing AI.
Subscribe and view previous issues here.
Thoughts, suggestions, feedback? Please send to thebatch@deeplearning.ai. Avoid our newsletter ending up in your spam folder by adding our email address to your contacts list.
|