New AI Readiness Report Reveals Insights into ML Lifecycle

Data quality problems must be addressed since they have a "significant downstream impact" on ML efforts, the report finds.

AI Business, Helen Hwang

July 6, 2022

2 Min Read

Machine learning and artificial intelligence concept

Alamy

Data quality is the biggest challenge faced by machine learning (ML) teams when acquiring training data, according to a recent survey of more than 1,300 practitioners in the field.

A third of respondents said they encounter data quality problems, followed by issues with collection, analysis, storage and versioning, according to Zeitgeist: AI Readiness Report by Scale AI.

These problems must be addressed since they have a "significant downstream impact" on ML efforts and teams often cannot model effectively without quality data," the survey said.

In the report, ML teams said they find it difficult to sort through volume, data complexity, and scarcity. Unstructured data poses a particular challenge. Practitioners find that curating data for its models impacts how quickly they can deploy their ML projects. Without high-quality data, teams cannot create robust models.

Variety, volume and noise

Factors contributing to data quality include variety, volume and noise.

In the survey, 37% find it difficult to find the data variety they need to improve model performance. Those working with unstructured data specifically have the biggest challenge getting the variety of data to improve model performance.

Since most of data today is unstructured, ML teams must have a strategy around how they manage this data to enhance data quality.

ML teams working with unstructured data are more likely than those working with semi-structured or structured data to have too little data.

Most respondents report problem with their training data, with data noise as the largest headache (67%), followed by data bias (47%) and domain gaps (47%). Only 9% did not have such issues.

The report offers these offered these five tips for data-centric AI development from Andrew Ng, co-founder of Google Brain.

Make labels consistent
Use consensus labeling to spot inconsistencies
Clarify labeling instructions
Toss out noisy examples (because more data is not always better)
Use error analysis to focus on a subset of data to improve

Read the rest of this article on AI Business.

About the Author(s)

AI Business

AI Business, an ITPro Today sister site, is the leading content portal for artificial intelligence and its real-world applications. With its exclusive access to the global c-suite and the trendsetters of the technology world, it brings readers up-to-the-minute insights into how AI technologies are transforming the global economy - and societies - today.

See more from AI Business

Helen Hwang

See more from Helen Hwang

Related Topics

Recent in Cloud

Related Topics

Recent in OS

Related Topics

Recent in IT Mgmt

Related Topics

Recent in Career

Related Topics

Recent in Storage

Related Topics

Recent in Security

Related Topics

Recent in Dev

Related Topics

Recent in DX

Related Topics

Recent in Infrastructure

Related Topics

New AI Readiness Report Reveals Insights into ML Lifecycle

Variety, volume and noise

About the Author(s)

Editor's Choice

Featured How Tos

Recent What Is

Related Topics

Recent in Cloud

Related Topics

Recent in OS

Related Topics

Recent in IT Mgmt

Related Topics

Recent in Career

Related Topics

Recent in Storage

Related Topics

Recent in Security

Related Topics

Recent in Dev

Related Topics

Recent in DX

Related Topics

Recent in Infrastructure

Related Topics

<span class="ArticleBase-LargeTitle">New AI Readiness Report Reveals Insights into ML Lifecycle</span>New AI Readiness Report Reveals Insights into ML Lifecycle

Variety, volume and noise

About the Author(s)

Editor's Choice

Featured How Tos

Recent What Is

New AI Readiness Report Reveals Insights into ML Lifecycle