Where do you put data quality?

It’s all very well to say that you should start doing data quality. But in your mental (or physical or digital) picture of your company and its IT, where does data quality actually go?

There are three main places you can put data quality:

The point of creation (gatekeeper)
The point of acquisition (restorative)
In the data retention process (continuous)

Each of these has different pluses and minuses and its own set of technical challenges.

A quick note: this article assumes that part of the data quality process is to correct and improve data. If you’re doing data quality monitoring only, the distinctions discussed here are much less relevant.

Point-of-creation data quality

Data quality at the point of creation acts as a gatekeeper.

Gatekeeper data quality performs data quality checks before the data is even created in the database. If the data fails any of the checks, the data must be corrected before it can be committed.

As a result, this kind of data quality is typically confined to situations where data is manually entered by humans. It tends to focus primarily on validation and standardization, relying on structural rules and reference databases as the source of truth, and takes place largely at the granular record level.

Although gatekeeper data quality is probably the type of data quality most people encounter day-to-day, it shouldn’t be relied on as the cornerstone of an overall data quality program.

At a technical level, gatekeeper data quality operates at a small scale and cannot address vital aspects of data quality like “accessibility” or “relevance.” It also requires a company to be in control of the portal where the data is originated, which isn’t always possible.

Operationally, gatekeeper data quality can produce a negative impact in several ways. Overly-strict configuration can produce situations that are, from the customer’s point of view, silly, pointless, and/or wrong. For example:

Rejecting a phone number because the user entered it with hyphens, because the valid format does not have hyphens.
Rejecting a gmail address for having a “+” sign in the username because the plus sign is considered an invalid character for emails.
Rejecting a city name because it has a space at the end.

Your author has personally encountered all of these, and they are very annoying.

Even if the data is entered by your employees, not by an actual customer, gatekeeper data quality interrupts their operational flow and competes against other pressures.

For instance, if your employees’ job performance is evaluated on their speed and customer satisfaction surveys, they are incentivized to achieve those things at the expense of high data quality, by finding ways to submit any data that will pass the gatekeeper as quickly as possible.

The resulting data is valid in the most literal sense but has a much looser relationship with concepts like “factual correctness,” defeating the purpose of collecting the data in the first place.

A prime example of this phenomenon is a retail store requiring a customer phone number to check out. When customers decline to answer, cashiers are extremely likely to enter a standard, valid placeholder like the store’s own phone number and move on. The store’s number is technically valid, but is useless for any operational purpose.

Point-of-acquisition data quality

Point-of-acquisition data quality acts as a restorative.

As that suggests, restorative data quality is applied after the data has been acquired by the company; it detects errors and attempts to remediate them automatically.

Restorative data quality is more flexible than gatekeeper DQ in that:

You do not need control over the data entry portal.
The generated data can be automated, not just manual.
It can operate at the record level and the entity level.
End users are not disrupted or required to make immediate changes manually.
It is usable when data arrives in bulk.

Obviously, this means that restorative data quality also requires a more advanced technical apparatus than gatekeeper data quality.

Simple facets of restorative data quality, such as resolving the phone number format and extra space examples described in gatekeeper data quality, can be handled by reasonably basic scripts or automations.

Some more advanced aspects of restorative data quality can be handled by external validation providers: address validation services, D&B lookups, identity validation services, etc. These are often reasonably accessible to small businesses or business units.

The most in-depth restorative options include performing entity resolution on the new data against the existing database entities. This generally requires a purpose-built entity resolution software product or code.

During-retention data quality

During-retention data quality acts continuously on the data: first when it’s acquired, and then on an ongoing basis while it’s stored in the database.

Unlike gatekeeper and restorative data quality, continuous data quality affects all the data at once, and may touch the same data more than once over time.

Continuous data quality makes high-level decisions about entity resolution, entity deduplications, and resolving data conflicts. Its defining feature is that it makes these decisions repeatedly over time, leveraging the complete history of the available data.

Meta-factors like data recency, the trustworthiness of a source, and the frequency with which a value occurs can all be included in continuous data quality. This makes it especially valuable for datasets where data about an entity can be expected to change over time.

For example, consider a customer database that contains an address. Continuous data quality could be configured to prioritize information that comes from the web portal the customer can use to change their own address information.

When a customer moves and changes their own address, continuous data quality can prioritize this new, customer-provided address, and even return the new address as a correction.

Continuous data quality is also particularly effective data quality issues that can arise even when none of the data is, strictly speaking, incorrect. For example, if someone uses different forms of their names at different times (John P Sousa vs J Philip Sousa), continuous entity resolution can bring the records with different names together.

The data to make these decisions, particularly about data like addresses that can be expected to undergo changes, isn’t always available at the entry point for the data, where gatekeeper and restorative data quality take place. In this situation, only continuous data quality can make these types of connections and corrections possible.

To be effective, continuous data quality requires sophisticated technical abilities, including:

Scenario-based match and merge
Extensive fuzzy match abilities
Relationship discovery
Performance capabilities that can perform the continuous dq alongside your operational requirements

Summary

There are several ways that you can implement data quality in your company:

At the point of acquisition, as a one-time gatekeeper action
At the point of creation, as a one-time restorative action
While the data is being stored, as an ongoing continuous action

While gatekeeper data quality is the easiest to implement in a technical sense, it’s also the least comprehensive, and can only be applied when data is submitted as single records. Because it requires errors to be corrected manually, it can easily cause subpar experiences for both customers and in-house end users, making it extremely vulnerable to workarounds.

Restorative and continuous data quality are more comprehensive than gatekeeper and less disruptive to the end user workflow, since they attempt to correct errors automatically.

Restorative data quality, while not as comprehensive as continuous data quality, has several advantages over gatekeeper data quality. Specifically, it can be applied to large amounts of data at once, is less disruptive to end users, and can to some degree act at the entity level.

Small businesses and business units can use third-party validation and standardization services to handle some aspects of this type of data quality, which makes it more accessible than continuous data quality.

While continuous data quality is the gold standard, it also has the most complex technical requirements and, at the time of this writing, generally requires a dedicated data quality solution.

As a result, comprehensive data quality may be out of reach for smaller organizations or business units that have limited resources for data quality. However, for large enterprises or companies with extremely tightly-regulated and/or sensitive data, continuous data quality is the most appropriate kind.