Dear friends,
A good way to get started in AI is to start with coursework, which gives a systematic way to gain knowledge, and then to work on projects. For many who hear this advice, “projects” may evoke a significant undertaking that delivers value to users. But I encourage you to set a lower bar and relish small, weekend tinkering projects that let you learn, even if they don’t result in a meaningful deliverable.
As a developer, too, I try to celebrate unique creations. Yes, it is nice to have beautiful software, and the impact of the output does matter. But good software is often written by people who spend many hours tinkering and building things. By building unique projects, you master key software building blocks. Then, using those blocks, you can go on to build bigger projects.
Andrew
P.S. On the heels of Microsoft’s announcement of the Copilot+ PC, which uses on-device AI optimized for a Qualcomm chip, we have a short course on deploying on-device AI created with Qualcomm! In “Introduction to On-Device AI,” taught by Qualcomm’s Senior Director of Engineering Krishna Sridhar, you’ll deploy a real-time image segmentation model on-device and learn key steps for on-device deployment: neural network graph capture, on-device compilation, hardware acceleration, and validating on-device numerical correctness. Please sign up here!
News
Faster, Cheaper MultimodalityOpenAI’s latest model raises the bar for models that can work with common media types in any combination. How it works: GPT-4o is a single model trained on multiple media types, which enables it to process different media types and relationships between them faster and more accurately than earlier GPT-4 versions that use separate models to process different media types. The context length is 128,000 tokens, equal to GPT-4 Turbo but well below the 2-million limit newly set by Google Gemini 1.5 Pro.
GPT-4o significantly outperforms Gemini Pro 1.5 at several benchmarks for understanding text, code, and images including MMLU, HumanEval, MMMU, and DocVQA. It outperformed OpenAI’s own Whisper-large-v3 speech recognition model at speech-to-text conversion and CoVoST 2 language translation. Aftershocks: As OpenAI launched the new model, troubles resurfaced that had led to November’s rapid-fire ouster and reinstatement of CEO Sam Altman. Co-founder and chief scientist Ilya Sutskever, who co-led a team that focused on mitigating long-term risks, resigned. He did not give a reason for his departure; previously he had argued that Altman didn’t prioritize safety sufficiently. The team’s other co-leader Jan Leike followed, alleging that the company had a weak commitment to safety. The company promptly dissolved the team altogether and redistributed its responsibilities. Potential legal issues also flared when actress Scarlett Johansson, who had declined an invitation to supply her voice for a new OpenAI model, issued a statement saying that one of GPT-4o’s voices sounded “eerily” like her own and demanding to know how the artificial voice was built. OpenAI denied that it had used or tried to imitate Johansson’s voice and withdrew that voice option. Why it matters: Competition between the major AI companies is putting more powerful models in the hands of developers and users at a dizzying pace. GPT-4o shows the value of end-to-end modeling for multimodal inputs and outputs, leading to significant steps forward in performance, speed, and cost. Faster, cheaper processing of tokens makes the model more responsive and lowers the barrier for powerful agentic workflows, while tighter integration between processing of text, images, and audio makes multimodal applications more practical. We’re thinking: Between GPT-4o, Google’s Gemini 1.5, and Meta’s newly announced Chameleon, the latest models are media omnivores. We’re excited to see what creative applications developers build as the set of tasks such models can perform continues to expand!
2 Million Tokens of Context & MoreGoogle’s annual I/O developers’ conference brought a plethora of updates and new models. How it works: Google launched a variety of new capabilities.
Precautionary measures: Amid the flurry of new developments, Google published protocols for evaluating safety risks. The “Frontier Safety Framework” establishes risk thresholds such as a model’s ability to extend its own capabilities, enable a non-expert to develop a potent biothreat, or automate a cyberattack. While models are in development, researchers will evaluate them continually to determine whether they are approaching any of these thresholds. If so, developers will make a plan to mitigate the risk. Google aims to implement the framework by early 2025. Why it matters: Gemini 1.5 Pro’s expanded context window enables developers to apply generative AI to multimedia files and archives that are beyond the capacity of other models currently available — corporate archives, legal testimony, feature films, shelves of books — and supports prompting strategies such as many-shot learning. Beyond that, the new releases address a variety of developer needs and preferences: Gemini 1.5 Flash offers a lightweight alternative where speed or cost is at a premium, Veo appears to be a worthy competitor for OpenAI’s Sora, and the new open models give developers powerful options. We’re thinking: Google’s quick iteration on its Gemini models is impressive. Gemini 1.0 was announced less than six months ago. White-hot competition among AI companies is giving developers more choices, faster speeds, and lower prices.
NEW FROM DEEPLEARNING.AIIn our new short course “Introduction to On-Device AI,” made in collaboration with Qualcomm, you’ll learn to deploy AI models on edge devices using local compute for faster inference and privacy. Join the next wave of AI as models go beyond the cloud! Enroll for free
Music Titan Targets AIThe world’s second-largest music publisher accused AI developers of potential copyright violations. How it works: In a statement posted on the company’s website and letters to developers, Sony forbade the use of its music or other media such as lyrics, music videos, album art for “training, developing, or commercializing any AI systems.”
Behind the news: In April, more than 200 music artists called for streaming services and AI developers to stop using their work for training and stop generating music in the styles of specific musicians without compensation. Universal Music Group (UMG), which is Sony Music’s top competitor, has also opposed unrestricted AI-generated music. Last year, UMG ordered Apple Music and Spotify to block AI developers from downloading its recordings and issued takedown notices to YouTube and Spotify uploaders who generated music that sounds like artists who are under contract to Universal. Why it matters: Sony Music Group’s warning comes as generated audio is approaching a level of quality that might attract a mainstream audience, and it could chill further progress. Although it is not yet clear whether training AI systems on music recordings without permission violates copyrights, Sony Music Group has demonstrated its willingness to pursue both individuals and companies for alleged copyright violations. The company accounted for 22 percent of the global music market in 2023. (UMG accounted for 32 percent.) Its catalog includes many of the world’s most popular artists including AC/DC, Adele, Celine Dion, and Harry Styles. We’re thinking: We believe that AI developers should be allowed to let their software learn from data that’s freely available on the internet, but uncertainty over the limits of copyright protection isn’t good for anyone. It’s high time to update to intellectual property laws for the era of generative AI.
Interpreting Image Edit InstructionsThe latest text-to-image generators can alter images in response to a text prompt, but their outputs often don’t accurately reflect the text. They do better if, in addition to a prompt, they’re told the general type of alteration they’re expected to make.
Results: Judges compared altered images produced by the authors’ method, InstructPix2Pix, and MagicBrush using the MagicBrush test set. Evaluating how well the generated images aligned with the instruction, 71.8 percent of the time, the judges preferred Emu Edit over InstructPix2Pix, and 59.5 percent of the time, they preferred Emu Edit over MagicBrush. Evaluating how well the generated images preserve elements from the input images, 71.6 percent preferred Emu Edit over InstructPix2Pix, and 60.4 percent preferred Emu Edit over MagicBrush.
A MESSAGE FROM FOURTHBRAINJoin FourthBrain's two live workshops next week! In these interactive sessions, you’ll build useful applications with large language models and walk away with practical skills. Enroll as an individual or register as a team for a group discount. Learn more
Work With Andrew Ng
Join the teams that are bringing AI to the world! Check out job openings at DeepLearning.AI, AI Fund, and Landing AI.
Subscribe and view previous issues here.
Thoughts, suggestions, feedback? Please send to thebatch@deeplearning.ai. Avoid our newsletter ending up in your spam folder by adding our email address to your contacts list.
|