Software plays an instrumental role in both the production and the correction of quality data. In the first instance, software is used to automate the profiling, cleansing, conditioning and validation of derived data sets. In the second instance, and within limits, software is used to correct for deficiencies in the quality of a derived data set. This contributes to the recognition that software tools can be used to “fix” many types of data quality problems. It likewise contributes to a prejudice: Namely, that data quality does not matter—or, more precisely, that the quality of data does not matter quite as much.
This is wishful thinking.
This article explores the problem of data quality from the perspective of the expert user: the data scientists, self-service analytic discoverers, ML/AI engineers, software engineers, etc. who rely on data to do their jobs. It considers the uses and abuses of software tools to correct for the data quality issues that these users are likely to encounter as they acquire, condition and use data in their work.
What Do We Mean by the Quality of Data?
The data quality requirements of data scientists and other experts differ from those of conventional consumers. In creating a derived data set, experts take pains to ensure that their data is statistically and scientifically relevant--that it comprises a sufficiently large (and representative) sample.
In his recent book Data: A Guide to Humans, my friend Phil Harvey describes five core aspects of what he calls “modern” data quality. Harvey distinguishes between the priorities of traditional data quality--which tend to focus on sales, product, customer, etc. data generated by core OLTP systems--and the more rigorous data quality regime that he believes is appropriate to experimental or scientific use.
These aspects are:
- Taxonomy: The “what” of the data set, its constitutive entities and relations. Quality problems include missing values, misspellings, inconsistent spellings and missing relations.
- Availability: The “when” of the data set; agreement and consistency in its temporal dimension. Common quality problems include inconsistent granularity (for example, data points are recorded at per-second and per-minute intervals for the same entity) and incomparable granularity (for example, data points are recorded at per-month and per-minute intervals for the same entity).
- Deviation: The “how” of quality. Do the data points in a data set make sense? Does something about them seem off? Are the individual values of data points consistent with prior samples? The range of data points? Does the scale of the change recorded in the data set make sense?
- Ethics: The “why” of quality. This “why” has to do not only with the uses to which the data set will be put, but with its status as an accurate representation of the relevant domain. If the data is representative, is it likely to embody certain known biases? The ethical aspect also has to do with how or under what conditions the data set was created. It has to do with the sensitivity of the data (that is, with its “taxonomy,” to use our term). It asks the questions: Should this data be used in this way for this work? And, if so, what safeguards should we employ prior to using it?
- Probability: This has to do with the practical application of the data set. What actionable outcomes are likely to devolve from how the data set gets used? A probability calculus must take account of the dominant corporate culture: How comfortable is this culture with ambiguity? Does the culture (or its most influential decision makers) privilege deterministic answers? What does a “yes” answer mean to a decision maker in this culture?
Each of the data quality aspects that Harvey describes is in play when machine learning (ML) engineers assemble a data set for their work. The better the data that a model is trained with, and the better the data it runs against in production, the more predictable its behavior and the more useful its output.
The first three aspects (taxonomy, availability, and deviation) have to do with the content and relevance of the data set itself. The ethical aspect governs how experts should do their work; the probabilistic aspect relates to what the experts’ employers should do with this work. The composite challenge is not only to feed analytic models quality data, but to understand and communicate the limits of this data (and of the models running against it) to the people who will use it, such as analysts, decision-makers and software engineers.
Software Is Instrumental in Producing and Correcting Quality Derived Data
The good news is that software can and does help with this.
Data scientists and other experts use software to produce and correct derived data sets. In the first instance, they employ ETL tools, cloud services, pre-built software libraries, code snippets and other software to acquire, condition and validate a data set prior to using it in their work. These tools are used to produce quality data. Concomitant with this, these experts have come to rely on pre-built models and other software tools to correct defects in the data sets they produce.
Models (or their equivalent functions) are now available as code snippets--or compiled into open-source libraries--that encapsulate common statistical and/or numerical techniques. The most important of these is sampling, which describes a broad category of methods used to correct or control for data quality problems, including data scarcity, biases endemic to a data set and time/resource constraints. The upshot is that experts reasonably expect to use sampling models--along with related tools (for example, models that generate synthetic data or interpolate missing data)--to “fix” data sets of variable quality.
The pervasiveness of data quality-oriented technologies--or, more precisely, their usefulness in both producing useful data sets and (albeit less successfully) and in controlling for defects in their quality--contributes to the prejudice that software alone is a sufficient means to fix problems with data.
Software Can Correct for Certain Types of Common Problems with the Quality of Data
With mainstream uptake and use of ML technology, experts are now able to identify a large number of heretofore hidden problems, most of which also have a data quality dimension.
These include overfitting, which occurs when a model “learns” to optimize for unique features in a data set, rather than for the general parameters of a problem. So, for example, an ML model uses a training data set to extract useful features that permit it to make useful generalizations about a problem. In an extreme case of overfitting (see below), the model finds a shortcut in the data set: a “dead giveaway,” so to speak. However, because the model is too tightly fitted to a specific data set, it performs poorly with novel data.
ML specialist Claudia Perlich once told me about a cancer-screening model that had “learned” to correlate a single feature (FMRI images encoded in grayscale) with a positive cancer diagnosis. The model’s creators thought their work a smashing success--until they tested it on color MRI images. The model performed abysmally. This was because what it had actually “learned” was that FMRI scans of patients at a cancer-treatment center are highly likely to feature tumors.
Overfitting can also reveal endemic human biases in a data set. Imagine, for example, an employee-screening model that generates a list of “ideal” job applicants. Drilling down into this list, a human screener discovers that each applicant is of the same gender (male) and ethnicity (white). Are these really the best candidates? Possibly. More likely, the model, trained on 35-plus years of employment data, selected the two features it found to be the best predictors of successful employment. In other words, the training data set captured the operation of human prejudice at work in hiring decisions.
But Software Alone Is Not Sufficient
Software can help with most of these problems; however, software alone is not a sufficient replacement for rigor and care in selecting and conditioning a rich, representative training data set.
Today, software is used to automatically balance a data set or identify common sources of bias. The trouble is that bias is a slippery problem. In his novel Absalom, Absalom!, William Faulkner describes a situation in which a character comes to recognize his prior, lost innocence only retrospectively: No longer innocent, he is now aware that he had been innocent; he had to lose his innocence in order to recognize it--in his wisdom. In a sense, then, to recognize bias in a data set is to recognize a previous state of innocence, and to alter one’s worldview accordingly. So, yes, software will play a major role in the identification of as-yet-unrecognized biases; however, the role of human know-how will be to analyze, formalize, and (not least) devise software remediations that identify and correct for biases.
And, yes, software for cross-validation and regularization techniques is regularly used to help detect and correct for overfitting. But all of these tools will work best if they are used in conjunction with carefully curated data sets. After all, one way to correct for underfitting is to gather more data--that is, to assemble a richer, more diverse data set. Ditto for techniques such as cross-validation and regularization. It should go without saying that this is true of ML models, too: Feed your models rich, statistically relevant data, and--voila!--they will behave better--that is, more predictably--in practice.
Creating a Data Set Is Just the Beginning of the Data Quality Journey
The irony is that the data that a model runs against in production may incorporate some or all of the problems that the expert user was at such pains to correct for in training that model.
This is a function of poor upstream data quality, which tends to be a very hard problem to fix, inasmuch as it implicates people; business applications and their workflows; reusable data integration jobs/data flows; and core business/IT processes. (The topic is one that merits a lengthy, separate treatment.)
In production, then, upstream data usually undergoes conditioning prior to being processed through a model. That is, to deploy the model is also to deploy the logic that the data scientist or ML engineer used to condition data for that model. In addition, the production model should ideally incorporate some kind of activity record that experts can use to monitor its quality and performance.
The takeaway? Preparing a data set and training a model is just the start of the data quality journey.