Dear friends,
Contrary to standard prompting advice that you should give LLMs the context they need to succeed, I find it’s sometimes faster to be lazy and dash off a quick, imprecise prompt and see what happens. The key to whether this is a good idea is whether you can quickly assess the output quality, so you can decide whether to provide more context. In this post, I’d like to share when and how I use “lazy prompting.”
I don’t try a lazy prompt if (i) I feel confident there’s no chance the LLM will provide a good solution without additional context. For example, given a partial program spec, does even a skilled human developer have a chance of understanding what you want? If I absolutely want to use a particular piece of pdf-to-text conversion software (like my team LandingAI’s Agentic Doc Extraction!), I should say so in the prompt, since otherwise it’s very hard for the LLM to guess my preference. I also wouldn’t use a lazy prompt if (ii) a buggy implementation would take a long time to detect. For example, if the only way for me to figure out if the output is incorrect is to laboriously run the code to check its functionality, it would be better to spend the time up-front to give context that would increase the odds of the LLM generating what I want.
By the way, lazy prompting is an advanced technique. On average, I see more people giving too little context to LLMs than too much. Laziness is a good technique only when you’ve learned how to provide enough context, and then deliberately step back to see how little context you can get away with and still have it work. Also, lazy prompting applies only when you can iterate quickly using an LLM’s web or app interface It doesn’t apply to prompts written in code for the purpose of repeatedly calling an API, since presumably you won’t be examining every output to clarify and iterate if the output is poor.
Keep building! Andrew
A MESSAGE FROM DEEPLEARNING.AIIn our latest course, “Getting Structured LLM Output,” you’ll learn to generate consistent, machine-readable outputs from LLMs using structured output APIs, re-prompting libraries, and token-level constraints. You’ll build a social media analysis agent that extracts sentiment and creates structured JSON ready for downstream use. Enroll for free
News
Interactive Voice-to-Voice With VisionResearchers updated the highly responsive Moshi voice-to-voice model to discuss visual input. What’s new: Amélie Royer, Moritz Böhle, and colleagues at Kyutai proposed MoshiVis. The weights are free to download under the CC-BY 4.0 license, which permits commercial and noncommercial uses. You can hear examples of its output and chat with a demo. Key insight: The original Moshi, which manages overlapping voice-to-voice conversations, comprises two transformers. The first outputs a text transcription of its speech, and the second outputs speech. Since Moshi generates text as well as speech, the authors of that work fine-tuned it to predict the next token of text. In MoshiVis, the addition of a vision encoder enabled the authors to fine-tune on not only image-text datasets but also image-speech datasets, which are not so plentiful. Fine-tuning on this wider variety of images enabled the system to understand images better than fine-tuning it solely on image-speech datasets. How it works: To Moshi, the authors added a model based on a pretrained SigLIP vision encoder to encode images, a cross-attention adapter to fuse image information with speech tokens, and vanilla neural networks trained to act as gates that determine how much image information to fuse. Specifically, the authors added the adapter and a gate between Moshi’s existing self-attention and fully connected layers.
Results: MoshiVis is highly responsive in conversation with latency of roughly 50 milliseconds on a Mac Mini.
Behind the news: MoshiVis complements a small but growing roster of systems that combine vision with speech-to-speech. ChatGPT accepts and generates speech in response to camera views or a user’s phone screen. AnyGPT (open weights training and inference code) accepts or generates speech, text, images, and music. Similarly, Mini-Omni2 (open weights and inference code) accepts and generates text, speech, and images. The authors didn’t compare MoshiVis to these alternatives. Why it matters: MoshiVis easily adapts a speech-to-speech model to work with a new type of media input. MoshiVis requires training only the adapters, while the earlier AnyGPT and Mini-Omni2, which can also discuss images via voice input and output, require training both adapters and the main model. We’re thinking: Text-chat models respond appropriately when a user refers to a previous topic or something new, and MoshiVis does, too, in spoken interactions. Evaluations of this capability will become increasingly important as voice-to-voice becomes more widespread.
Scraping the Web? Beware the MazeBots that scrape websites for AI training data often ignore do-not-crawl requests. Now web publishers can enforce such appeals by luring scrapers to AI-generated decoy pages. What’s new: Cloudflare launched AI Labyrinth, a bot-management tool that serves fake pages to unwanted bots, wasting their computational resources and making them easier to detect. It’s currently free to Cloudflare users. How it works: AI Labyrinth protects webpages by embedding them with hidden links to AI-generated alternatives that appear legitimate to bots but are irrelevant to the protected site.
Behind the news: The robots.txt instructions that tell web crawlers which pages they can access aren’t legally binding, and web crawlers can disregard them. However, online publishers are moving to try to stop AI developers from training models on their content. Cloudflare, as the proxy server and content delivery network for nearly 20 percent of websites, plays a potentially large role in this movement. AI crawlers account for nearly 1 percent of web requests on Cloudflare’s network, the company says. Why it matters: The latest AI models are trained on huge quantities of data gleaned from the web, which enables them to perform well enough to be widely useful. However, publishers increasingly aim to limit access to this data. AI Labyrinth gives them a new tool that raises the cost for bots that disregard instructions not to scrape web content. We’re thinking: If AI Labyrinth gains traction, no doubt some teams that build crawlers will respond with their own AI models to sniff out its decoy pages. To the extent that the interest between crawlers and publishers is misaligned and clear, enforceable rules for crawling are lacking, this cat-and-mouse competition could go on for a long time.
Learn More About AI With Data Points!AI is moving faster than ever. Data Points helps you make sense of it just as fast. Data Points arrives in your inbox twice a week with six very brief news stories. This week, we covered Gemini 2.5 Pro taking the top spot on key benchmarks and adapting R1-like techniques to video reasoning. Subscribe today!
Chatbot Use Creates Emotional BondsA pair of papers investigate how increasingly human-like chatbots affect users’ emotions. What’s new: Jason Phang at OpenAI, Cathy Mengying Fang at MIT Media Lab, and colleagues at those organizations published complementary studies that examine ChatGPT’s influence on loneliness, social interactions, emotional dependence, and potentially problematic use. How it works: One study was a large-scale analysis of real-world conversations, and the other was a randomized control trial that tracked conversations of a selected cohort. Both evaluated conversations according to EmoClassifiersV1, a set of classifiers based on large language models that evaluate five top-level emotional classes (loneliness, dependence, and the like) and 20 sub-classes of emotional indicators (seeking support, use of pet names, and so on).
Results: Both studies found that using ChatGPT was associated with reduced loneliness and increased emotional chat. However, it was also associated with decreased interpersonal social interaction and greater dependence on the chatbot, especially among users who spent more time chatting. Yes, but: The authors of the randomized controlled trial acknowledged significant limitations. For instance, the study lacked a non-ChatGPT control group to differentiate AI-specific effects from influences such as seasonal emotional shifts, and the trial’s time frame and assignments may not mirror real-world behavior. Why it matters: As AI chatbot behavior becomes more human-like, people may lean on large language models to satisfy emotional needs such as easing loneliness or grief. Yet we know little about their effects. These studies offer a starting point for AI developers who want to both foster emotional support and protect against over-reliance, and for social scientists who want to better understand the impact of chatbots. We’re thinking: Social media turned out to cause emotional harm to some people in ways that were not obvious when the technology was new. As chatbots evolve, research like this can help us steer them toward protecting and enhancing mental health.
Human Action in 3DAI systems designed to generate animated 3D scenes that include active human characters have been limited by a shortage of training data, such as matched 3D scenes and human motion-capture examples. Generated video clips can get the job done without motion capture. What’s new: A team led by Hongjie Li, Hong-Xing Yu, and Jiaman Li at Stanford University developed Zero-Shot 4D Human-Scene Interaction (ZeroHSI), a method that animates a 3D human figure interacting with a particular 3D object in a selected 3D scene. You can see its output here. Key insight: Earlier approaches attempted to build a generalized approach: given a 3D scene, a text prompt, and motion-capture data, a diffusion model learned to alter the positions and rotations of human joints and objects over time. But if the system is designed to learn a 3D animation for a specific example motion, videos can stand in for motion capture. Current video generation models can take an image of a scene and generate a clip of realistic human motion and interactions with a wide variety of objects within it. From there, we can minimize the difference between the video frames and images of actions within the scene. How it works: ZeroHSI takes a pre-built 3D scene that includes a 3D human mesh and 3D object. It uses a rendered image of the scene to generate a video. Then it uses the video to help compute the motions of a human figure and object within the scene.
Results: The authors evaluated ZeroHSI using a proprietary dataset of 12 3D scenes that included a human figure and an object and between one and three text prompts that described interactions between the human and object and/or scene. In 100 evaluations, ZeroHSI outperformed LINGO, a diffusion model trained on matched 3D scene, 3D object, and human motion-capture data that had achieved the previous state of the art.
Why it matters: Learning from motion-capture data is problematic in a couple of ways: (i) it’s expensive to produce, (ii) so little of it is available, which limits how much a learning algorithm can generalize from it. Video data, on the other hand, is available in endless variety, enabling video generation models to generalize across a wide variety of scenes, objects, and motions. ZeroHSI takes advantage of generated video to guide a 3D animation cheaply and effectively. We’re thinking: There’s a lot of progress to be made in AI simply by finding clever ways to use synthetic data.
Work With Andrew Ng
Join the teams that are bringing AI to the world! Check out job openings at DeepLearning.AI, AI Fund, and Landing AI.
Subscribe and view previous issues here.
Thoughts, suggestions, feedback? Please send to thebatch@deeplearning.ai. Avoid our newsletter ending up in your spam folder by adding our email address to your contacts list.
|