|
Dear friends,
I’ve been hearing from people at all levels of seniority about feelings of job insecurity. High-school students wonder if there will be a job for them, engineers worry about keeping up, C-level officers wonder if they’ll manage to help their businesses transform with AI, and many more. Amid the frenetic pace of AI advances and multiple sources of geopolitical uncertainty, the future feels less certain at this moment than at any other I recall. In moments like this, I think about what we can build to take advantage of the exciting possibilities ahead. But I also think about what is stable that I can count on, such as community and skills. To those of you who are navigating an uncertain environment, I hope this will bring some comfort.
Jeff Bezos famously said that knowing what’s not going to change in the next 10 years creates a stable foundation on which to build businesses. Many many things in the world will still be the same in 10 years as now. But for individuals who are worried about job security, I would put forward two things that I think will be stable in that timeframe: Community and skills.
First, 10 years from now, I know that my friends and family will still be there for me. Just as I know that no matter what, I will still be there for them. Relationships can be incredibly durable. During Covid, many communities bonded and supported each other. In moments of uncertainty, having communities — networks of relationships — helps everyone. That’s why opportunities to build relationships are so valuable and help us both get more done and protect ourselves against downside risks.
I find in-person gatherings, where we can make new friends and refresh existing relationships, especially helpful. If you would like to attend a high-quality AI event, please come to AI Dev, which will be held on April 28-29 in San Francisco. I’ve really enjoyed meeting people there — in fact, AI Aspire, an advisory firm I co-lead, started from a meeting at a previous AI Dev! — and I’m confident that you will, too.
Keep building! Andrew
A MESSAGE FROM DEEPLEARNING.AIIn our latest course, made in collaboration with Oracle, you’ll build a complete agent memory system that lets an LLM store, retrieve, and refine knowledge across sessions, turning a stateless agent into one that learns and improves over time. Sign up here
News
Drones Hit Persian Gulf Data Centers
Iran hit at least three Amazon data centers in the Middle East, an indicator of AI’s critical role in the United States’ war against Iran and possibly the first time such facilities have been targeted during warfare.
What happened: Iranian drones damaged an Amazon Web Services (AWS) facility in Bahrain and two in the United Arab Emirates (UAE), disrupting online services including banking, payments, ride sharing, food delivery, and business software. The U.S. military uses AWS to run the unclassified version of Anthropic Claude and possibly other computing systems, but it didn’t disclose whether the attacks affected its operations.
Drone attacks: Early on March 1, drones struck two AWS data centers in the UAE, and the Bahrain data center suffered damage shortly afterward. Amazon said the Bahrain attack was “a drone strike in close proximity to one of our facilities,” while Iran said it had targeted the facility “to identify the role of these centers in supporting the enemy’s military and intelligence activities,” according to to Iran’s state-controlled Fars News Agency via the messaging service Telegram.
Behind the news: The risk to data centers mirrors a rise in AI’s role in warfare. Despite the U.S. military’s recent decision to ban defense uses of Anthropic’s Claude large language model, U.S. forces routinely use Claude and other systems for a variety of purposes in Iran and elsewhere. For its part, Iran uses weaponized drones that have some degree of autonomy.
Yes, but: As AI raises the pace of military decision-making, it also raises the risk of deadly mistakes. For example, during the initial wave of air strikes on Iran, a bomb destroyed a school, killing more than 170 people, mostly children. In a subsequent investigation, preliminary findings indicate that U.S. forces likely dropped the bomb. Out-of-date target data may have played a role in targeting the building, since the school was part of a nearby naval base roughly 15 years ago.
Why it matters: The sharp rise of AI-enabled warfare signals a shift in the pace of combat from human to machine speed. AI makes it practical to plan missions by running vast numbers of simulations to identify actions most likely to lead to success. It accelerates battlefield decisions and actions while potentially reducing the so-called fog of war that can obscure realities on the battlefield. Missions become viable that previously were constrained by lack of human attention to analyze the flood of battlefield communications, imagery, and other information. The acceleration could shorten some phases of conflict, yet it adds pressure to make snap decisions that could have catastrophic consequences.
We’re thinking: AI-generated recommendations don’t remove the need to verify intelligence, question assumptions, and weigh the moral and strategic consequences of using force.
Qwen3.5 Outperforms Bigger Models, Leads Vision Benchmarks
The Qwen3.5 family of open-weights vision-language models includes impressive larger models as well as a smaller one that outperforms an OpenAI open-weights model 10 times its size.
What’s new: Alibaba released the Qwen3.5 family of eight open weights vision-language models.
Qwen 3.5 specs: The new models offer tool use, web search, and chain-of-thought reasoning in more than 200 natural languages.
How it works: Alibaba shared little information about how it built the Qwen3.5 family.
Results: Tested by Alibaba, all Qwen3.5 models excelled at vision tasks, outperforming much larger models, and some turned in competitive results in language tasks as well. Qwen3.5-9B and Qwen3.5-4B showed the most impressive performance overall, shining in both vision and language tasks, even compared to much larger models. The two smallest variations lack comparative metrics.
Behind the news: Shortly after rolling out the Qwen3 family, Lin Junyang, the team’s technical lead and a key architect of the models, abruptly resigned with a post on the X social network that read, “Bye my beloved qwen.” The Chinese tech-news outlet 36kr.com subsequently reported that four other members of the team resigned in his wake. In a January public appearance, Lin had said, “We are stretched thin — just meeting delivery demands consumes most of our resources,” Bloomberg reported. Alibaba responded by putting the Qwen project under tighter supervision by senior leadership and promising to invest further in AI development.
Why it matters: All Qwen3.5 models deliver stellar vision performance for their sizes, but the smaller models — especially Qwen3.5-9B — are small enough to run on consumer laptops while delivering performance that previously required an 80 gigabyte GPU like the Nvidia H100.
We’re thinking: Vision-language models with reasoning capability that are small enough to run locally means reduced cost, better privacy, and new vistas for vision-language applications.
Learn More About AI With Data Points!
AI is moving faster than ever. Data Points helps you make sense of it just as fast. Data Points arrives in your inbox twice a week with six brief news stories. This week, we covered Perplexity’s expansion into autonomous AI agents across desktop and enterprise tools and Nvidia’s secure infrastructure for deploying AI agents in production. Subscribe today!
DeepSeek Snubs Nvidia for Huawei
DeepSeek, the Chinese developer of outstanding open-weights models, has withheld an upcoming update of its flagship model from U.S. chip makers, a move that intensifies the AI rivalry between the U.S. and China.
What’s new: DeepSeek has not yet given Nvidia and AMD an opportunity to make sure its upcoming DeepSeek-V4, which is in the final stages of development, will run smoothly on their chips — a departure from the typical practice prior to major model updates. However, it did share a prerelease version of the model with Huawei, giving the Chinese chip maker several weeks to optimize the software for its hardware, Reuters reported. It did not report DeepSeek’s reasoning for the decision.
How it works: Chip makers typically examine new models ahead of their release to make sure inference runs efficiently on their hardware. In the past, DeepSeek has worked closely with Nvidia to train its models.
Behind the news: For years, the U.S. has tried to slow China’s AI effort by restricting exports of advanced chips and the equipment needed to produce them. But that effort has largely backfired by spurring China to build its domestic chip industry, and China’s government has taken steps to encourage or require companies in that country to use domestic chips.
Why it matters: While DeepSeek’s decision to withhold prerelease access to DeepSeek-V4 from U.S. chip makers may be more symbolic than significant, it deepens the divide between the portions of the AI community that are based in the U.S. and in China. The decision aligns with China’s long-standing goals of technological self-sufficiency, so critical AI capabilities remain available regardless of adversaries’ efforts to block them.
We’re thinking: The possibility that DeepSeek trained its latest model using Nvidia chips is one among several indicators that export restrictions alone are not stopping international rivals from gaining access to U.S. chips. The world would benefit more from negotiated limits, mutual cooperation, and free exchange of ideas, technology, and trade.
A Single Tokenizer for Visual Media
Multimodal models typically use different tokenizers to embed different media types, and different encoders when training to generate media rather than classify it. A team at Apple created a multidimensional tokenizer that maps images, videos, and 3D objects into a shared token space for any of these media types — and a shared encoder that performs well at both identifying such objects and generating them.
What’s new: Jiasen Lu, Liangchen Song, and colleagues from Apple trained AToken, a transformer model with an all-purpose visual tokenizer. The new model can both generate and classify images, videos, and 3D, approaching the performance of specialized models for each of these input and output types.
Key insight: Image generation models use encoders (like VAEs or VQ-VAEs) that preserve visual details (is the cat’s/ball’s surface orange?) but discard semantics (is it a cat or a ball?), and therefore don’t recognize objects as well as classification models. Image classification models, on the other hand, use encoders (like CLIP or SigLIP) that capture types of objects (say, “cat” or “ball”) but miss visual details, so they are worse at generation. Moving from still images to video and 3D complicates matters further. Before they’re encoded, video and 3D typically require separate tokenizers to break down images into data an encoder can process, each with its own architecture and embedding space. If the three media types are analyzed using the same tokenizer in a single format, one transformer can learn to work with all of them. Further, training the model to reconstruct these media types and align them to matching text descriptions forces embeddings to retain both fine visual details and semantic references, eliminating the need for separate generation and classification models.
How it works: AToken consists of a pretrained SigLIP2 vision encoder (400 million parameters) — here extended from two dimensions to four — and an untrained decoder of the same size. The authors trained AToken to reconstruct inputs and align their embeddings to text using three image sets (two public and one private), three public sets of videos, and two public sets of 3D objects, all paired with matching text. They trained on this data in three stages: first images, then videos, and last 3D.
Results: AToken matched or closely approached state-of-the-art models that process images, videos, or 3D.
Why it matters: A major innovation of large language models is their use of a single tokenizer for all language inputs, whether code, dialogue, tables, or books, etc. This generality eases a model’s ability to transfer knowledge from one data source to another during training: when models get better at understanding or generating text, they get better at code, too. AToken offers a similar generality for vision models, particularly when it comes to 2D and 3D objects. AToken’s strong performance at generating and reconstructing multiple visual media types suggests that here, too, a shared tokenizer and encoder could allow improvements from one modality to carry over to the others.
We’re thinking: A model like AToken may prove helpful for generating synthetic 3D and video data. Models that generalize from one media type to another generally reduce the amount of total data needed to train for each task. For example, high-quality, well-labeled, two-dimensional image data is abundant compared to video and 3D, which are both essential to robotics applications.
Work With Andrew Ng
Join the teams that are bringing AI to the world! Check out job openings at DeepLearning.AI, AI Fund, and Landing AI.
Subscribe and view previous issues here.
Thoughts, suggestions, feedback? Please send to thebatch@deeplearning.ai. Avoid our newsletter ending up in your spam folder by adding our email address to your contacts list.
|