Andrew Ng’s Post

Founder of DeepLearning.AI; Managing General Partner of AI Fund; Founder and CEO of Landing AI

How would you define Data-centric AI development? It'd be useful for all of us collectively to write a sharable definition. Here's my attempt: Data-centric AI is the practice of systematically engineering the data used to build AI systems. What do you think/any suggestions? Please also 👍 anyone else's suggestions that you see and like!

250 Comments

Ian🤗 Yu

Pragmatic ML Engineer in the Age of Generative AI

Rather than a definition, I propose we also know what's possible, especially connect with the industry. Many data scientists I talk to often don't see the value of data-centric approach because it reads like cleaning data and data augmentation, and it's just a standard quality data science practice. But it's not! There's a lot of conversation that we could have around: - Taxonomy of data - Statistical Data Quality Control - Data Quality Control at Scale - Various Augmentation consideration... We have collected some resources that entail why there's much more about data-centric AI here: Data-centric Natural Language Processing https://ai.science/l/89be6ae3-db92-4a4e-a93e-2aac0551a9a9

19 Reactions

Line Clemmensen

Professor ML/AI/Data driven innovation | Co-founder

First, I am very excited to see you, and others, focus on DCAI, and I am certain it will spin off extremely valuable contributions to the AI field. I like your original definition, but would like to add something on evaluation, mostly because I think there is also a need to develop new ways of evaluating and measuring performance as we embark on this journey, focusing on representative and high quality data. Something along the lines of: Data-centric AI is the practice of systematically engineering and evaluating the data used to build AI systems.

6 Reactions

David R. Winer

Co-founder of Cerbrec | PhD in Computer Science | Techstars Boston '24 | AI + ML Engineer | Co-creator of Graphbook

Data-centric AI development is a stance that data is part of the configuration of the model, just like the model architecture. What do we do when customers are in charge of their own data e.g. as part of a self service dialogue system? Can the system be data-centric without enforcing a systematic practice by the user?

7 Reactions

🌏 Gregorio Ferreira 🚀

Principal MLOps Engineer / Head of AI/ML CoE Unit | AI for Society

Data-centric AI: is the ability to design data acquisition systems that directly feed AI-based processes that enrich (or provide support) automation/information systems. A company in which management focus their investments in data acquisition instead of new/more models. They would become data-centric, data-driven. What do you think Victor Jaramillo, PhD?

27 Reactions

Peter Cotton

Chief Data Scientist at ExodusPoint Capital Management, LP

I fear that the data science terminology industry is becoming like the fund-of-funds industry. Pretty soon there will be more phrases describing combinations of existing concepts, and combinations of combinations, than there are underlying concepts themselves.

30 Reactions

Andrew Elmsley

Head of Engineering @ Invert | PhD, AI, Software Engineering

"systematically engineering the data" could sound like cherry-picking examples. How about: Data-centric AI is the practice of systematically interrogating the data used to build AI systems. It aims to ensure that the data is clean, correct, fair and unbiased.

12 Reactions

Adam Richardson

Assistant Professor of Cybersecurity at Lansing Community College

Here's my attempt: Data-centric AI - the prioritization of data engineering principles throughout the process of building AI systems.

9 Reactions

Gino Tesei

AI relies on the agile practice of iterating on models and data to build AI systems. We have two extremes. One side, model-centric AI is the practice of engineering better models as focus of an iteration. At the other extreme, we have data-centric AI, which is the practice of engineering better data as focus of an iteration. They are not mutually exclusive but, on the contrary, complementary. The problem is not which one is right for a given organization but rather what is the right mix.

12 Reactions

Dr. Madalasa Venkataraman

Snr Director, Data Science/Data Engineering Unity, CX Marketing, Oracle

Interesting question. The way I see it- there are three steps. The first is the constant scan for new data that adds value to the model. Any data science model, however well built, needs to constantly innovate with new data because this brings innovative insights. This ‘new data source’ may not be perfect. The second one is strengthening the data collection for existing data sources which are utilised in the model. A lot of answers here address this. This gives more consistent model outcomes for business Third is controlling the data that goes into productionizing the model. Here is data quality control and data drift and lots of help from mlops. This gives lesser headaches in production pipelines but does not keep with the data trends (which are essentially user trends captured in data) noise in data is not ‘just noise in data’ but changes in user behaviour and trends. While (1) on new data source identification is done through academics, (2) is the purview of our software systems and (3) is the mlops and data scientist domain. Data science needs to encompass and influence all three

7 Reactions

Boris Korolkov

EMEA OEM Sales Leader, Data&AI, IBM Technology

I can hardly imagine AI which is not Data-centric. And agree with Mateus, data quality (and overall DataOps approach) should be considered as one of the key items to Data-centric AI, in addition to systematically engineering data.

20 Reactions

See more comments

To view or add a comment, sign in

More Relevant Posts

Andrew Ng

Founder of DeepLearning.AI; Managing General Partner of AI Fund; Founder and CEO of Landing AI
20h
Report this post
Meta released Llama 3 on my birthday! 🎂 Best present ever, thanks Meta! 😀

528 Comments
Like Comment
To view or add a comment, sign in
Andrew Ng

Founder of DeepLearning.AI; Managing General Partner of AI Fund; Founder and CEO of Landing AI
21h
Report this post
Multi-agent collaboration has emerged as a key AI agentic design pattern. Given a complex task like writing software, a multi-agent approach would break down the task into subtasks to be executed by different roles -- such as a software engineer, product manager, designer, quality assurance engineer, and so on -- and have different agents accomplish different subtasks. Different agents might be built by prompting a LLM to carry out different tasks. For example, to build a software engineer agent, we might prompt: "You are an expert in writing clear, efficient code. Write code to perform the task …". It might seem counterintuitive that, although we are making multiple calls to the same LLM, we apply the programming abstraction of using multiple agents. I'd like to offer a few reasons: - It works! Many teams are getting good results with this method, and there's nothing like results! Ablation studies (for example, in the AutoGen paper cited below) show that multiple agents give superior performance to a single agent. - Even though some LLMs today can accept very long input contexts, their ability to truly understand long, complex inputs is mixed. An agentic workflow in which the LLM is prompted to focus on one thing at a time can give better performance. - It gives us a framework for breaking down complex tasks. When writing code to run on a single CPU, we often break our program up into different processes or threads. This lets us decompose a task -- like implementing a web browser -- into subtasks that are easier to code. Multi-agents roles is, similarly, a useful abstraction. In companies, managers decide what roles to hire and then how to split complex projects into smaller tasks to assign to employees with different specialties. Using multiple agents is analogous. Each agent implements its own workflow, has its own memory (itself a rapidly evolving area in agentic technologies -- how can an agent remember enough of its past interactions to perform better on upcoming ones?), and may ask other agents for help. Agents themselves can also engage in Planning and Tool Use. While managing people is hard, it's a sufficiently familiar idea that it gives us a mental framework for how to "hire" and assign tasks to our AI agents. Frameworks like AutoGen, Crew AI, and LangGraph provide rich ways to build multi-agent solutions. If you're interested in playing with a fun multi-agent system, check out ChatDev, an open source implementation of a set of agents that run a virtual software company. While it may not always produce what you want, you might be amazed at how well it does! To learn more, I recommend: - Communicative Agents for Software Development, Qian et al. (2023) (the ChatDev paper) - AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation, Wu et al. (2023) - MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework, Hong et al. (2023) [Original text: https://lnkd.in/g7gShRnf ]

AI Agents With Low/No Code, Hallucinations Create Security Holes, and more

deeplearning.ai

66 Comments
Like Comment
To view or add a comment, sign in
Andrew Ng

Founder of DeepLearning.AI; Managing General Partner of AI Fund; Founder and CEO of Landing AI
3d
Report this post
LLMs can take gigabytes of memory to store, which limits what can be run on consumer hardware. But quantization can dramatically compress models, making a wider selection of models available to developers. You can often reduce model size by 4x or more while maintaining reasonable performance. In our new short course Quantization Fundamentals taught by Hugging Face's Younes Belkada and Marc Sun, you'll: - Learn how to quantize nearly any open source model - Use int8 and bfloat16 (Brain float 16) data types to load and run LLMs using PyTorch and the Hugging Face Transformers library - Dive into the technical details of linear quantization to map 32-bit floats to 8-bit integers As models get bigger and bigger, quantization becomes more important for making models practical and accessible. Please check out the course here: https://lnkd.in/g66yNW8W

112 Comments
Like Comment
To view or add a comment, sign in
Andrew Ng

Founder of DeepLearning.AI; Managing General Partner of AI Fund; Founder and CEO of Landing AI
4d
Report this post
Planning is a key agentic AI design pattern in which we use a large language model (LLM) to autonomously decide on what sequence of steps to execute to accomplish a larger task. For example, if we ask an agent to do online research on a given topic, we might use an LLM to break down the objective into smaller subtasks, such as researching specific subtopics, synthesizing findings, and compiling a report. Many people had a “ChatGPT moment” shortly after ChatGPT was released, when they were surprised that it significantly exceeded their expectation of what AI can do. If you have not yet had a similar “AI Agentic moment,” I hope you will soon! I had one several months ago, when I presented a live demo of a research agent I had implemented that had access to various online search tools. I had tested this agent multiple times privately, during which it consistently used a web search tool to gather information and wrote up a summary. During the live demo, the web search API unexpectedly returned with a rate limiting error. I thought my demo was about to fail publicly. To my surprise, the agent pivoted deftly to a Wikipedia search tool — which I had forgotten I’d given it — and completed the task using Wikipedia instead of web search. This was an AI Agentic moment of surprise for me. It’s a beautiful thing when you see an agent autonomously do things in ways that you had not anticipated, and succeed as a result! Many tasks can’t be done in a single step. For example, to simplify an example from the HuggingGPT paper (cited below), if you want an agent to examine a boy's picture and draw a picture of a girl in the same pose, the task might be decomposed into two steps: (i) detect the boy's pose and (ii) render a picture of a girl in the detected pose. An LLM might be fine-tuned or prompted (with few-shot prompting) to specify a plan by outputting a string like "{tool: pose-detection, input: image.jpg, output: temp1 } {tool: pose-to-image, input: temp1, output: final.jpg}". This structured output triggers software to invoke a pose detection tool followed by a pose-to-image tool to complete the task. (This example is for illustrative purposes only; HuggingGPT uses a different format.) Admittedly, many agentic workflows do not need planning. For example, you might have an agent reflect on, and improve, its output a fixed number of times, resulting in a set of fixed, deterministic steps. But for complex tasks in which you can't specify a task decomposition ahead of time, Planning allows the agent to decide dynamically what steps to take. To learn more, I recommend: - Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, Wei et al. (2022) - HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face, Shen et al. (2023) - Understanding the planning of LLM agents: A survey, by Huang et al. (2024) [Original text: https://lnkd.in/gM2ZWNsW ]

Autonomous Coding Agents, Instability at Stability AI, Mamba Mania, and more

deeplearning.ai

104 Comments
Like Comment
To view or add a comment, sign in
Andrew Ng

Founder of DeepLearning.AI; Managing General Partner of AI Fund; Founder and CEO of Landing AI
1w
Report this post
Data preprocessing is critical for building effective RAG systems. Our new short course, Preprocessing Unstructured Data for LLM Applications, taught by Matt Robinson of unstructured.io, demonstrates important but sometimes overlooked aspects of RAG systems: - How to extract and normalize content from diverse formats like PDF, Powerpoint, and HTML to expand your LLM's knowledge - Enriching data with metadata to enable more powerful retrieval and reasoning - Applying document layout analysis and vision transforms to process embedded images and tables Then you’ll apply all these skills and build a RAG bot that draws from a corpus that includes PDF, PowerPoint, and Markdown documents. Please sign up here: https://lnkd.in/g84D4dJg

125 Comments
Like Comment
To view or add a comment, sign in
Andrew Ng

Founder of DeepLearning.AI; Managing General Partner of AI Fund; Founder and CEO of Landing AI
1w
Report this post
The Financial Times has a great article on Renate Nyborg's work on Meeno, written by Madhumita Murgia. The article is paywalled, but I appreciate Renate (as well as Harvard's Ron Ivey)'s leadership speaking about the dangers of the AI fake girlfriend/boyfriend industry and the risk of this leading to greater loneliness. Renate says "Men didn’t want to meet girls because they had virtual girlfriends who said exactly what they wanted to hear." To regulators wondering what are the risky applications of AI, I would urge taking a look at the fake gf/bf industry! In contrast, Meeno gives advice for human relationships, and is working to bring people together. Working to reduce human loneliness is a wonderful goal! https://lnkd.in/gsg2JCJP

The loneliness cure

ft.com

61 Comments
Like Comment
To view or add a comment, sign in
Andrew Ng

Founder of DeepLearning.AI; Managing General Partner of AI Fund; Founder and CEO of Landing AI
1w
Report this post
The task-based analysis of how AI affects jobs is a powerful technique for creating business value. It was pioneered by Workhelix's Erik Brynjolfsson et al. Now, Workhelix has developed technology to apply this at scale, by automatically examining a company’s job descriptions, professional social data, and other information, to give CEOs and Boards a roadmap to creating value. AI Fund is thrilled to support Workhelix’s launch, coming Tuesday April 9th. To learn more, please join the conversation with Erik Brynjolfsson, Andrew McAfee, Daniel Rock, James Milin and me at the webinar below! https://lnkd.in/g8V7sJh5

46 Comments
Like Comment
To view or add a comment, sign in
Andrew Ng

Founder of DeepLearning.AI; Managing General Partner of AI Fund; Founder and CEO of Landing AI
2w
Report this post
Tool use, in which an LLM is given functions it can request to call for gathering information, taking action, or manipulating data, is a key design pattern of AI agentic workflows. You may be familiar with LLM-based systems that can perform web search or execute code; some large consumer-facing LLMs already incorporate these features. But tool use goes well beyond this. If you prompt an online LLM-based chat system, “What is the best coffee maker according to reviewers?”, it might decide to use web search to gain context. Early on, LLM developers realized that relying only on a transformer to generate output tokens is limiting, and that giving an LLM a web search tool lets it do much more. With such a tool, an LLM is either fine-tuned or prompted (perhaps with few-shot prompting) to generate a special string like {tool: web-search, query: "coffee maker reviews"} to request calling a search engine. (The exact format of the string depends on the implementation.) A post-processing step then looks for strings like these, calls the web search function with the relevant parameters when it finds one, and passes the result back to the LLM as additional input context. Similarly, if you ask, “If I invest $100 at compound 7% interest for 12 years, what do I have at the end?”, the LLM might use a Python execution tool to compute 100 * (1+0.07)**12. The LLM might generate a string like this: {tool: python-interpreter, code: "100 * (1+0.07)**12"}. But tool use now goes much further. Developers are using functions to perform search (web, Wikipedia, arXiv, etc.), interface with productivity tools (email, calendar, etc.), generate or interpret images, and more. We can prompt an LLM using context that gives detailed descriptions of many functions plus information about their arguments. And we’d expect the LLM to automatically choose the right function to call to do a job. Before widespread availability of large multimodal models (LMMs) like LLaVa, GPT-4V, and Gemini, LLMs could not process images directly, so a lot of early work on tool use was carried out by the computer vision community. At that time, the only way for an LLM-based system to manipulate images was by calling a function to, say, carry out object recognition or some other function. Since then, tool use has exploded. GPT-4’s function calling, released last year, was a significant step toward general-purpose tool use. Now, more and more LLMs are being developed to similarly be facile with tool use. To learn more, I recommend: - Gorilla: Large Language Model Connected with Massive APIs, Patil et al. (2023) - MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action, Yang et al. (2023) - Efficient Tool Use with Chain-of-Abstraction Reasoning, Gao et al. (2024) Both Tool Use and Reflection, which I posted about last week, are design patterns that I can get to work fairly reliably, and are well worth learning about. [Original: https://lnkd.in/gw-EwcTB ]

Microsoft Absorbs Inflection, Nvidia's New GPUs, Managing AI Bio Risk, and more

deeplearning.ai

65 Comments
Like Comment
To view or add a comment, sign in
Andrew Ng

Founder of DeepLearning.AI; Managing General Partner of AI Fund; Founder and CEO of Landing AI
2w
Report this post
Learn to carry out red teaming attacks against your own LLM-based applications to spot and patch vulnerabilities! In our new short course, “Red Teaming LLM Applications,” Matteo Dora & Luca Martial of LLM testing company Giskard teach how to simulate malicious actions to discover vulnerabilities, and improve security. We start with prompt injection, where you can trick an LLM into bypassing safeguards to reveal private information, or say something inappropriate. There is no one-size-fits-all approach to security, but this course will help you identify some scenarios to protect against. We believe having red teaming capabilities widely known will result in greater transparency and safer LLM-based systems. However, we ask you to use the skills you gain from this course ethically. Sign up here: https://lnkd.in/gtM-JRqt

68 Comments
Like Comment
To view or add a comment, sign in
Andrew Ng

Founder of DeepLearning.AI; Managing General Partner of AI Fund; Founder and CEO of Landing AI
2w
Report this post
I hope everyone in Taiwan 🇹🇼 is okay after the earthquake. My thoughts are with everyone affected. ❤️

25 Comments
Like Comment
To view or add a comment, sign in

1,550,609 followers

View Profile Follow

Andrew Ng’s Post

More from this author

Learn to Speak or Teach Better in 30 Minutes

Explore topics