How would you define Data-centric AI development? It'd be useful for all of us collectively to write a sharable definition. Here's my attempt: Data-centric AI is the practice of systematically engineering the data used to build AI systems. What do you think/any suggestions? Please also 👍 anyone else's suggestions that you see and like!
First, I am very excited to see you, and others, focus on DCAI, and I am certain it will spin off extremely valuable contributions to the AI field. I like your original definition, but would like to add something on evaluation, mostly because I think there is also a need to develop new ways of evaluating and measuring performance as we embark on this journey, focusing on representative and high quality data. Something along the lines of: Data-centric AI is the practice of systematically engineering and evaluating the data used to build AI systems.
Data-centric AI development is a stance that data is part of the configuration of the model, just like the model architecture. What do we do when customers are in charge of their own data e.g. as part of a self service dialogue system? Can the system be data-centric without enforcing a systematic practice by the user?
Data-centric AI: is the ability to design data acquisition systems that directly feed AI-based processes that enrich (or provide support) automation/information systems. A company in which management focus their investments in data acquisition instead of new/more models. They would become data-centric, data-driven. What do you think Victor Jaramillo, PhD?
I fear that the data science terminology industry is becoming like the fund-of-funds industry. Pretty soon there will be more phrases describing combinations of existing concepts, and combinations of combinations, than there are underlying concepts themselves.
"systematically engineering the data" could sound like cherry-picking examples. How about: Data-centric AI is the practice of systematically interrogating the data used to build AI systems. It aims to ensure that the data is clean, correct, fair and unbiased.
Here's my attempt: Data-centric AI - the prioritization of data engineering principles throughout the process of building AI systems.
AI relies on the agile practice of iterating on models and data to build AI systems. We have two extremes. One side, model-centric AI is the practice of engineering better models as focus of an iteration. At the other extreme, we have data-centric AI, which is the practice of engineering better data as focus of an iteration. They are not mutually exclusive but, on the contrary, complementary. The problem is not which one is right for a given organization but rather what is the right mix.
Interesting question. The way I see it- there are three steps. The first is the constant scan for new data that adds value to the model. Any data science model, however well built, needs to constantly innovate with new data because this brings innovative insights. This ‘new data source’ may not be perfect. The second one is strengthening the data collection for existing data sources which are utilised in the model. A lot of answers here address this. This gives more consistent model outcomes for business Third is controlling the data that goes into productionizing the model. Here is data quality control and data drift and lots of help from mlops. This gives lesser headaches in production pipelines but does not keep with the data trends (which are essentially user trends captured in data) noise in data is not ‘just noise in data’ but changes in user behaviour and trends. While (1) on new data source identification is done through academics, (2) is the purview of our software systems and (3) is the mlops and data scientist domain. Data science needs to encompass and influence all three
I can hardly imagine AI which is not Data-centric. And agree with Mateus, data quality (and overall DataOps approach) should be considered as one of the key items to Data-centric AI, in addition to systematically engineering data.
Pragmatic ML Engineer in the Age of Generative AI
2yRather than a definition, I propose we also know what's possible, especially connect with the industry. Many data scientists I talk to often don't see the value of data-centric approach because it reads like cleaning data and data augmentation, and it's just a standard quality data science practice. But it's not! There's a lot of conversation that we could have around: - Taxonomy of data - Statistical Data Quality Control - Data Quality Control at Scale - Various Augmentation consideration... We have collected some resources that entail why there's much more about data-centric AI here: Data-centric Natural Language Processing https://ai.science/l/89be6ae3-db92-4a4e-a93e-2aac0551a9a9