Andrew Ng’s Post

View profile for Andrew Ng

Founder of DeepLearning.AI; Managing General Partner of AI Fund; Founder and CEO of Landing AI

How would you define Data-centric AI development? It'd be useful for all of us collectively to write a sharable definition. Here's my attempt: Data-centric AI is the practice of systematically engineering the data used to build AI systems. What do you think/any suggestions? Please also 👍 anyone else's suggestions that you see and like!

Ian🤗 Yu

Pragmatic ML Engineer in the Age of Generative AI

2y

Rather than a definition, I propose we also know what's possible, especially connect with the industry. Many data scientists I talk to often don't see the value of data-centric approach because it reads like cleaning data and data augmentation, and it's just a standard quality data science practice. But it's not! There's a lot of conversation that we could have around: - Taxonomy of data - Statistical Data Quality Control - Data Quality Control at Scale - Various Augmentation consideration... We have collected some resources that entail why there's much more about data-centric AI here: Data-centric Natural Language Processing https://ai.science/l/89be6ae3-db92-4a4e-a93e-2aac0551a9a9

Line Clemmensen

Professor ML/AI/Data driven innovation | Co-founder

2y

First, I am very excited to see you, and others, focus on DCAI, and I am certain it will spin off extremely valuable contributions to the AI field. I like your original definition, but would like to add something on evaluation, mostly because I think there is also a need to develop new ways of evaluating and measuring performance as we embark on this journey, focusing on representative and high quality data. Something along the lines of: Data-centric AI is the practice of systematically engineering and evaluating the data used to build AI systems.

David R. Winer

Co-founder of Cerbrec | PhD in Computer Science | Techstars Boston '24 | AI + ML Engineer | Co-creator of Graphbook

2y

Data-centric AI development is a stance that data is part of the configuration of the model, just like the model architecture. What do we do when customers are in charge of their own data e.g. as part of a self service dialogue system? Can the system be data-centric without enforcing a systematic practice by the user?

🌏 Gregorio Ferreira 🚀

Principal MLOps Engineer / Head of AI/ML CoE Unit | AI for Society

2y

Data-centric AI: is the ability to design data acquisition systems that directly feed AI-based processes that enrich (or provide support) automation/information systems. A company in which management focus their investments in data acquisition instead of new/more models. They would become data-centric, data-driven. What do you think Victor Jaramillo, PhD?

Peter Cotton

Chief Data Scientist at ExodusPoint Capital Management, LP

2y

I fear that the data science terminology industry is becoming like the fund-of-funds industry. Pretty soon there will be more phrases describing combinations of existing concepts, and combinations of combinations, than there are underlying concepts themselves.

Andrew Elmsley

Head of Engineering @ Invert | PhD, AI, Software Engineering

2y

"systematically engineering the data" could sound like cherry-picking examples. How about: Data-centric AI is the practice of systematically interrogating the data used to build AI systems. It aims to ensure that the data is clean, correct, fair and unbiased.

Adam Richardson

Assistant Professor of Cybersecurity at Lansing Community College

2y

Here's my attempt: Data-centric AI - the prioritization of data engineering principles throughout the process of building AI systems.

Gino Tesei

Leadership | Generative AI | Healthcare | Pharma | Big Tech | Consulting | Search | Precision Medicine | Cloud Computing

2y

AI relies on the agile practice of iterating on models and data to build AI systems. We have two extremes. One side, model-centric AI is the practice of engineering better models as focus of an iteration. At the other extreme, we have data-centric AI, which is the practice of engineering better data as focus of an iteration. They are not mutually exclusive but, on the contrary, complementary. The problem is not which one is right for a given organization but rather what is the right mix.

Dr. Madalasa Venkataraman

Snr Director, Data Science/Data Engineering Unity, CX Marketing, Oracle

2y

Interesting question. The way I see it- there are three steps. The first is the constant scan for new data that adds value to the model. Any data science model, however well built, needs to constantly innovate with new data because this brings innovative insights. This ‘new data source’ may not be perfect. The second one is strengthening the data collection for existing data sources which are utilised in the model. A lot of answers here address this. This gives more consistent model outcomes for business Third is controlling the data that goes into productionizing the model. Here is data quality control and data drift and lots of help from mlops. This gives lesser headaches in production pipelines but does not keep with the data trends (which are essentially user trends captured in data) noise in data is not ‘just noise in data’ but changes in user behaviour and trends. While (1) on new data source identification is done through academics, (2) is the purview of our software systems and (3) is the mlops and data scientist domain. Data science needs to encompass and influence all three

Boris Korolkov

EMEA OEM Sales Leader, Data&AI, IBM Technology

2y

I can hardly imagine AI which is not Data-centric. And agree with Mateus, data quality (and overall DataOps approach) should be considered as one of the key items to Data-centric AI, in addition to systematically engineering data.

See more comments

To view or add a comment, sign in

Explore topics