Dear friends,
Fine-tuning small language models has been gaining traction over the past half year. I’d like to share my sense of when to use this technique, and also when not to, based on what I’m seeing in multiple companies.
Improving accuracy of critical applications. Prompting can get you really far for many applications. But sometimes, fine-tuning helps eke out that last bit of accuracy. For example, if you are building a customer service chatbot and need it to call the right API reliably (say, to carry out transactions, issue refunds, and the like), perhaps prompting can get it to make the right API call 95% of the time. But if you struggle to raise the accuracy even with revisions to the prompt and you really need 99% accuracy, fine-tuning on a dataset of conversations and API calls might be a good way to get you there. This is particularly true for tasks where it's hard to specify, using only language, an unambiguous rule to decide what to do. For example, when a customer is frustrated, should the chatbot escalate to a manager or just issue a refund? Teams often write Standard Operating Procedures (SOPs) for human workers to follow, and these SOPs can go into the prompts of models. But if it is hard to specify an unambiguous SOP, so even humans need to see numerous examples before they can learn what to do, fine-tuning can be a good approach. For many text-classification applications fine-tuning also works well, for example, classifying medical records into diagnosis and procedure codes for health insurance claims.
Reducing latency or cost during scale-ups. I’ve seen applications where developers have successfully prompted a large model to perform a complex task. But as usage scales up, if the large model is too slow (which often happens) or too expensive (which also happens but less frequently), the team might want to use a smaller model. If, however, the performance of the smaller model isn't good enough, then fine-tuning it can help bring it up to the performance of the larger one for that narrow application. Further, the larger model (or perhaps an agentic workflow) can also be used to generate data to help with fine-tuning the small model for that task.
Keep learning! Andrew
A MESSAGE FROM DEEPLEARNING.AIIn “Vibe Coding 101 with Replit,” you’ll learn to plan, prompt, and debug alongside a coding agent. Build, host, and share two real web apps in Replit’s cloud environment while developing effective development skills like writing product requirements, structuring tasks, and refining AI-generated code. Start today
News
Vision-Language, Compact and OpenGoogle updated its open-weights family of large language models to include versions that handle image and video inputs. What’s new: Google released its Gemma 3 multilingual large language models with parameter counts of 1 billion, 4 billion, 12 billion, and 27 billion. While the smallest processes text only, the other three are vision-language models that are small enough to run on a consumer hardware.
How it works: Gemma 3 rearchitects and refines earlier Gemma models for higher performance at lower parameter counts.
Performance: Gemma 3 models outperform Gemma 2 models of equal or larger size by several measures, and all sizes show a strong ability to solve mathematics word problems as measured by MATH.
Hot on Gemma 3’s heels: Shortly after Gemma 3 became available, Mistral released Small 3.1 (24 billion parameters), a vision-language model with open weights, under a more permissive Apache 2.0 license.
Why it matters: Gemma 3 takes advantage of a variety of techniques to raise the bar for vision-language performance in relatively small models. Knowledge distillation, multiple rounds of reinforcement learning, and fine-tuning on many languages are a powerful combination. We’re thinking: A vision-language model small enough to run on a smartphone feels increasingly close!
Better Images in Fewer StepsDiffusion models usually take many noise-removal steps to produce an image, which takes time at inference. There are ways to reduce the number of steps, but the resulting systems are less effective. Researchers devised a streamlined approach that doesn’t sacrifice output quality. Key insight: At inference, a scheduler like Euler can enable a model to take larger steps than those it learned during training, but this approach yields worse performance. Alternatively distillation, in which a student model learns to remove the same amount of noise as a teacher model when it takes several steps, offers improved performance at the cost of more cumbersome development. Training the model directly to take bigger steps — that are equivalent to multiple smaller steps — enables it to maintain high performance while taking fewer steps. How it works: The authors trained DiT-B, a diffusion transformer, to generate images like those in CelebA-HQ (celebrity faces) and ImageNet-256 (various subjects, size 256x256).
Results: The authors compared their model using 1, 4, or 128 steps to alternatives that were trained via various methods including many variants of distillation. They measured the results using Fréchet inception distance (FID), which assesses how closely generated images resemble real-world images (lower is better).
Why it matters: Generating images by diffusion is typically costly, and previous approaches to cutting the cost have compromised either performance or incurred additional development expense or both. This method achieves high performance at relatively low cost.
Learn More About AI With Data Points!AI is moving faster than ever. Data Points helps you make sense of it just as fast. Data Points arrives in your inbox twice a week with six brief news stories. This week, we covered an AI-powered weather predictor and a model that combines vision and speech. Subscribe today!
LLM Support for TutorsStudents benefit from tutoring, but training tutors is expensive. A study shows that large language models can boost tutors’ effectiveness in real time.
Results: The authors partnered with a virtual tutoring company and a school district in the United States for a two-month study of 874 tutors and 1,787 students between grades 3 and 8. They divided the participants into two groups. In one group, tutors conducted sessions with students as usual. In the other, tutors had access to Tutor CoPilot. The authors measured success by the percentage of students who passed a test at the end of a lesson.
Yes, but: The authors found statistically significant improvements as measured by test results per lesson, but not in end-of-year exam results. The study’s two-month duration may account for the lack of evidence for longer-term effects.
Faster Learning for Diffusion ModelsDiffusion transformers learn faster when they can look at embeddings generated by a pretrained model like DINOv2. What’s new: Sihyun Yu and colleagues at Korea Advanced Institute of Science and Technology, Korea University, New York University, and Scaled Foundations (a startup that builds AI for robotics) proposed Representation Alignment (REPA), a loss term for transformer-based diffusion. Key insight: Diffusion models learn to remove noise from images to which noise was added (and, at inference, they start with pure noise to generate a fresh image). This process can be divided into two parts: learning to (i) embed the noisy image and (ii) estimate the noise from the embedding. One way to accelerate learning is to add a loss term that encourages the diffusion model to produce embeddings that are similar to those produced by a pretrained embedding model. The diffusion model can learn to estimate the noise faster if it doesn’t need to learn how to embed an image from scratch. How it works: The authors modified DiT-XL/2 and SiT-XL/2 transformer-based latent diffusion models, a class of diffusion models that subtract noise from embeddings rather than images. They trained the models to produce images similar to ImageNet. In the process, the modified models learned to produce embeddings similar to those produced by a pretrained DINOv2.
Results: The modified DiT-XL/2 learned significantly faster than the unmodified version.
Why it matters: Diffusion models and contrastive self-supervised models like DINOv2 have fundamentally different training objectives: One produces embeddings for the purpose of image generation, while the other’s embeddings are used for tasks like classification and semantic segmentation. Consequently, they learn different aspects of data. This work proposes a novel way to combine these approaches to produce more generally useful embeddings. We’re thinking: It turns out that the REPA modification enabled diffusion models to produce embeddings better suited not only to diffusion but also to image classification and segmentation. A similar approach could lead to a more holistic framework for learning image representations.
Work With Andrew Ng
Join the teams that are bringing AI to the world! Check out job openings at DeepLearning.AI, AI Fund, and Landing AI.
Subscribe and view previous issues here.
Thoughts, suggestions, feedback? Please send to thebatch@deeplearning.ai. Avoid our newsletter ending up in your spam folder by adding our email address to your contacts list.
|