|
Dear friends,
Happy 2026! Will this be the year we finally achieve AGI? I’d like to propose a new version of the Turing Test, which I’ll call the Turing-AGI Test, to see if we’ve achieved this. I’ll explain in a moment why having a new test is important.
The original Turing Test, which required a computer to fool a human judge, via text chat, into being unable to distinguish it from a human, has been insufficient to indicate human-level intelligence. The Loebner Prize competition actually ran the Turing Test and found that being able to simulate human typing errors — perhaps even more than actually demonstrating intelligence — was needed to fool judges. A main goal of AI development today is to build systems that can do economically useful work, not fool judges. Thus a modified test that measures ability to do work would be more useful than a test that measures the ability to fool humans.
For almost all AI benchmarks today (such as GPQA, AIME, SWE-bench, etc.), a test set is determined in advance. This means AI teams end up at least indirectly tuning their models to the published test sets. Further, any fixed test set measures only one narrow sliver of intelligence. In contrast, in the Turing Test, judges are free to ask any question to probe the model as they please. This lets a judge test how “general” the knowledge of the computer or human really is. Similarly, in the Turing-AGI Test, the judge can design any experience — which is not revealed in advance to the AI (or human subject) being tested. This is a better way to measure generality of AI than a predetermined test set.
Happy New Year, and have a great year building!
Andrew
Agents of 2026
The pieces are in place: AI models have gained the ability to generate coherent text, images, videos, and other data; draw upon proprietary databases; and navigate the web and take actions online. Get ready for a Cambrian Explosion of intelligent applications that help us live better lives and steward our organizations and communities. In this special issue of The Batch, as in previous New Year issues, some of the brightest minds in AI share their hopes for what comes next.
Open Source Winsby David Cox
My hope is that open AI continues to flourish and ultimately wins.
We know IBM has a reputation for being boring. But boring can actually be good. Boring is stable; it’s a foundation you can build on. IBM is also a little weird. That stable foundation actually lets you do weird things without them falling apart. Let’s make AI more open, more weird, and maybe a little more boring in 2026.
AI for Scientific Discoveryby Adji Bousso Dieng
In 2026, I hope AI will transition from being a tool for efficiency to a catalyst for scientific discovery.
For the last decade, the dominant paradigm in deep learning has been interpolation. We have built incredibly powerful models that excel at mimicking the distribution of their training data. This is perfect for the applications where AI shines right now, such as conversational agents and coding assistants, where a query can be answered by identifying statistical patterns in existing data. This paradigm has even led to successful applications that meet scientific challenges that can be formulated as supervised learning problems, such as AlphaFold.
However, within that paradigm, models struggle with the rarest examples, the tails of the data distribution. For instance, in our work with the Vendiscope, a tool we developed to audit data collections, we found that even AlphaFold struggles to predict the 3D structures of rare proteins. Furthermore, many grand challenges in the physical sciences, from designing de novo proteins to discovering novel metal-organic frameworks (MOFs) that capture CO2 from the atmosphere cannot be framed as supervised learning problems. Rather, they can be framed as discovery problems where what is sought is rare.
In these settings, the dominant modes of the distribution are often scientifically uninteresting because they represent more of what we already know. In 2026, I hope we finally crack the code on discovery, moving to techniques that can tame the tail of the distribution and even discover meaningful things that are out of distribution. The goal is to find things that nature allows but we haven’t yet seen.
To make this leap from interpolation to discovery, the AI community must prioritize a fundamental shift in the objective functions that drive machine learning. We need to move beyond maximizing accuracy and probabilistic likelihoods, objectives that inherently drive models toward interpolation and collapse to the dominant modes of the data distribution. Instead, we need to raise diversity as a first-class objective, rather than treating it solely as a vague sociotechnical concept for fairness.
At my lab, Vertaix, we have led this thread of research by developing the Vendi Score. In our research on materials discovery, we found that optimizing the Vendi Score allowed us to identify stable, energy-efficient MOFs that standard search methods missed because they could not effectively explore a search space that spans trillions of materials.
In 2026, we should stop treating diversity merely as a secondary evaluation metric and start treating it as the primary mathematical engine for discovery. If we make this shift, AI will cease to be just an imitator of human knowledge and become a true partner in expanding it.
Adji Bousso Dieng is founder of the Vertaix research lab at Princeton University and co-principal investigator of the National Science Foundation Institute for Data-Driven Dynamical Design. She is founder of The Africa I Know, a nonprofit that supports STEM education for young Africans.
Education That Works With — Not Against — AIby Juan M. Lavista Ferres
A little more than three years ago, OpenAI released ChatGPT, and education changed forever. For students, the ability to generate fluent, credible text on demand in seconds is an incredible new tool. For educators, it is a new kind of challenge. In the coming year, I hope the education community will make peace with AI as an educational tool and focus on developing reliable ways to evaluate student performance in the era of generative media.
In the months that followed ChatGPT’s arrival, a comforting story was widely shared: If generative AI could write essays, then we could build AI detectors to identify them. Some early studies reported near-perfect accuracy in controlled settings. The implicit promise was appealing: teachers would not need to rethink assessment. We could keep the same workflows, the same assignments, the same enforcement model.
That hope was an illusion. In a lab, these systems can perform very well. But their performance assumes that students will submit the raw model output. They won’t. The moment there is a detector, students have an incentive to evade it. And evasion is not difficult. Rewrite a paragraph. Add a few typos. Change sentence lengths. Reorder sections. Insert personal anecdotes. Translate and re-translate. Or use any of the growing set of tools that exist to rewrite AI output to look “human.”
This is the structural problem: If you can build a system that detects AI-generated text, then you can use that system to train a system that defeats it. The moment a detector is deployed, entrepreneurs will build products to break it, and students will learn to use them.
But the biggest problem is not designing effective detectors. It is maintaining trust. If educators rely on detector scores and students rely on programs designed to defeat detectors, educators are pushed into suspicion and adjudication. You end up confronting students, navigating appeals, and making high-stakes judgments without reliable evidence. You risk harming students, especially non-native English speakers, and students who have learned to follow certain academic conventions. Meanwhile, the students most committed to misuse will adapt fastest. So in practice, detection can penalize the wrong people while failing to deter the most sophisticated evasion.
Generative AI can improve learning. It can help students practice, give feedback, and deliver tutoring. It can translate material into a student’s own language and help personalize learning at scale.
But we need to be realistic. The traditional take-home essay, used as a universal proof of independent authorship, is broken. Verifying independent authorship through text alone no longer works at scale. Universities and schools should assume students will use generative AI, and they need assessment models that still work in that reality.
A few practical moves:
The genie is out of the bottle. There is no way to put it back. Our job now is to build the rules and practices that make education more effective, and more trustworthy, in the world we actually live in.
Juan M. Lavista Ferres is chief data scientist at Microsoft and a corporate vice president. He directs the Microsoft AI for Good Lab and the Microsoft AI Economy Institute.
From Prediction to Actionby Tanmay Gupta
AI research in 2026 should confront a simple but transformative realization: Models that predict are not the same as systems that act. The latter is what we actually need.
Over the last decade, we have become extraordinarily good at passive prediction and generative modeling — producing bounding boxes and segmentation masks for objects in images, transcribing audio into text, or generating fluent paragraphs and images on command. These are impressive achievements, yet they remain proxy tasks: tasks that are often assumed to represent real-world economic utility. This is a fallacy. The world’s economically meaningful tasks do not end at a single prediction or generation from a single input. They require taking a sequence of actions (each of which may be a function of predictions or generations from one or more models) in complex, dynamic environments where each action shapes the state of the environment and hence subsequent actions.
In 2026, AI research must move decisively from solving these proxy tasks to the corresponding long-horizon realistic tasks that these proxy tasks loosely approximate. Consider how coding has evolved: Models once autocompleted lines, but modern coding agents increasingly take a high-level specification, search through a codebase, run tests, and return a working solution with minimal human intervention.
I hope we can bring this evolution — from generating proxies to accomplishing goals — to other domains. For example, vision models should be studied as parts of larger systems that use visual input streams to drive digital (web/computer use) and physical (embodied) workflows, monitor processes, or extract insights. Speech systems need to be studied as part of intelligent conversational assistant architectures that understand objectives conveyed through conversation and interface with digital or physical tools to fulfill them. Image- and video-generation models should be studied as parts of systems that generate, say, long-form visual educational content from existing documents or marketing material for products or research artifacts.
Shifting focus to these long-horizon tasks and goal-oriented AI systems has two major benefits. First, it exposes limitations and pain-points of current AI models when we use them to construct these larger systems and pipelines. These goal-oriented AI systems need more than predictive or generative capability. They require persistent memory, ability to focus on a goal over a long time horizon, responsiveness to real-time human feedback, and the ability to cope with uncertainty in an evolving environment. They also require effective interfacing with a wide variety of multimodal information sources, tool calling, the ability to hypothesize and reason, continual learning, self-improvement, and more. Many of these gaps in capability are invisible on short-horizon or single-step predictive tasks but reveal themselves in more complex and realistic long-horizon scenarios. We need better ways to evaluate these aspects of intelligence and methods to improve them.
Second, this goal-centric reframing aligns AI research with end-task utility. By directly trying to solve real end-tasks, researchers are less likely to be led astray by the siren song of seemingly useful proxy tasks that ultimately prove to be incapable of solving real tasks. For instance, for years, semantic parsing was assumed to be an important component of natural language understanding systems by NLP researchers. Today’s LLMs are capable of sophisticated language understanding and manipulation without ever explicitly performing semantic parsing. In hindsight, the semantic parsing research-hours were perhaps better spent on trying to solve the end-task instead of chasing the proxy metric of semantic parsing accuracy.
Real digital or physical tasks unfold over minutes, hours, months, and sometimes years. Humans have the extraordinary capability of consolidating diverse information collected over extended periods of time into a consistent world-view that drives execution of complex goals in evolving environments. The technological advancements in deep learning over the last decade, particularly in LLMs and VLMs, have well set the stage for the AI research community to take a serious shot at replicating this ability on silicon in the next decade. In the last couple years alone, we have seen the rise of LLM-powered agentic systems that are automating well defined workflows. Tackling the underspecified, ill-defined, undiscovered, and unimagined is the next frontier.
Tanmay Gupta is a senior research scientist at the Allen Institute for Artificial Intelligence. He is a co-author of “Visual Programming: Compositional Visual Reasoning Without Training,” which won best paper at CVPR 2023. His work spans multimodal agents, coding agents, VLMs, and VLAs.
Multimodal Models for Biomedicineby Pengtao Xie
Over the past few years, we have seen rapid progress in models that jointly reason over text, images, sequences, graphs, and time series. Yet in biomedical settings, these capabilities often remain fragmented, brittle, or difficult to interpret. In 2026, I hope the community moves decisively toward building multimodal models that are not only powerful but also scientifically grounded, transparent, and genuinely useful to biomedical discovery and clinical decision-making.
Chatbots That Build Communityby Sharon Zhou
Next year, I’m excited to see AI break out of 1:1 relationships with each of us. In 2026, AI has the potential to bring people together and unite us with human connection, rather than polarize and isolate us. It’s about time for ChatGPT to enter your group chats.
The internet today feels like it’s getting pushed toward two extremes. On one end, it’s heavy AI slopification that paints a strictly worse, noisier version of our former internet — with bots participating in forums and scraping data (getting DDOS’ed by AI scrapers ~1 million times a day is not weird!) On the other end, it’s heavy human curation that’s trying to keep the LLMs out as much as possible.
But this tension doesn’t have to be adversarial. It can be integrating instead. AI can be designed to connect people and strengthen human connections. The bot in the chat becomes a positive uniting force, rather than a neutral assistant or a deceptive agent. To accomplish this, researchers will need to change some things, like post-training on longer contexts and different reinforcement learning environments to handle multi-human contexts and objectives. But it can be done, and I believe it will introduce new heights of intelligence, human and artificial.
As you talk to your LLM at 3:00 A.M. about solving a relationship problem and how it’s like debugging your code, your LLM asks you whether you want to talk to someone else who feels the same way. You think, “well, I thought my problem was niche at this hour, but why not.” What’s more, the LLM isn’t just there to make the intro. It joins your chat, making jokes with funny memes and asking interesting questions to make the conversation lively and full of curiosity — until you realize you’ve made a couple of friends, fixed your bug, and have a new lens for approaching your relationship. You’ve learned something helpful for your job and your personal life. And it’s only 3:15 A.M.
Curiosity accelerates when it’s shared. It’s infectious. It’s easier to learn things when you’re motivated by a group and where it’s trying to go, reach, explore. As a collective tool, AI can further our curiosity and creativity together. And there’s a chance that some of those enlightening conversations will be the new data needed to lift AI’s intelligence.
It would be quite the win-win if we design a future where the AI is incentivized to bring people together and give people a sense of belonging with each other, and in so doing, get people inventing more things and growing our collective intelligence in a way that serves as data that pushes models in ways that benchmarks on isolated chats don’t. This might even motivate new model architectures, like an extreme MoE (mixture of experts) that has lightweight, partially shared weights for each person and your multi-dimensional self, like a more evolved version of scratchpad memory today.
Today, the advances are close and this future is completely viable, which is why it excites me. I hope this year we take a step toward making AI a more positive force on humanity at large and on our individual humanity. This is one path that we can take in that direction.
Sharon Zhou is the Corporate Vice President of AI at AMD. Formerly, she was founder and CEO at Lamini and an adjunct instructor at Stanford University.
Work With Andrew Ng
Join the teams that are bringing AI to the world! Check out job openings at DeepLearning.AI, AI Fund, and Landing AI.
Subscribe and view previous issues here.
Thoughts, suggestions, feedback? Please send to thebatch@deeplearning.ai. Avoid our newsletter ending up in your spam folder by adding our email address to your contacts list.
|