Customer Data Quality: Beyond the Bucket and Broom
During 2001 and 2002, leading enterprises began managing and leveraging information as tangible assets. Over the next two years, evolving accounting and auditing principles will engender accepted information valuations and tags for purposes of origination, privacy, quality, usage, and so forth. By 2005, these changes will drive online intellectual capital marketplaces and information service "banks."
CRM implementations continue to be limited by data integration and quality issues, which are more often stumbled upon rather than anticipated or planned for, during the course of development. So pervasive are data-related issues that data quality ranks in the top 35 percent of project management concerns and data integration is the top architectural challenge noted by users in META Group's latest data warehouse and analytics industry study. Still, the bulk of data quality solutions that are implemented fail to leverage the variety of solutions available to achieve competitive advantage.
Indeed, attention to data quality is considered one of IT's many "necessary evils." And being diligent about data quality can seem like janitorial work far removed from generating business value. But the truth is contrary to this notion. Incremental improvements, in particular with customer data, lead to significant business performance gains in each phase of the customer lifecycle engage, transact, fulfill, service by improving prospecting, transaction and fulfillment accuracy, personalization, and customer satisfaction. Just as data warehousing and/or analytic solutions have become a common denominator for enhanced business performance, data quality practices, techniques, or technologies will become embedded into 95 percent of all CRM and e-business initiatives in the near future. By 2003, leading packaged operational, analytical, and information management solutions, such as middleware, information logistics, and metadata, will evolve to embody a variety of data quality capabilities through company acquisitions and partnership. Following in the wake of these mergers (2004-2005) will be a noticeable (and likely off-shore) black market for private household and business information.
While data quality needs are typically believed to only deal with data accuracy issues, myriad types of other data quality issues persist that demand their own distinct solutions. Enterprises typically require data quality solutions that fall into more than one of these four patterns validation, standardization, correction, enrichment and similarly, vendors offer a mix of capabilities within each pattern (Figure 1).

Figure 1 — Data Quality Solution Providers
Data Quality Patterns
Data Validation
At any point in an organization's information supply chain, data is subject to injected correctness, completeness, and integrity errors. Particularly during the course of a manually-entered transaction, data should be parsed, matched, and confirmed against an authoritative source either an internal master database or an information content provider. Parsing identifies tokens like surname or postal code, while matching performs a lookup against an existing source, and confirmation completes the validation process by applying business rules or templates that indicate its degree of fitness to continue flowing through the information supply chain.
Data Standardization
The ever-increasing variety of data sources flowing into organizations drives the need for robust functions that transform validated data into enterprise-accepted and application-digestible formats. However, even two XML documents adhering to the same data type definition structure may differ in scale, precision, or even vary in format. During the standardization process, tokens are rearranged, reformatted, and/or integrated into defined templates. For example, converting all address data into a four-line format.
Data Correction
A second offshoot of the validation process involves repairing data that is determined to be wrong, such as misspelled, transposed, out-of-date, or otherwise inaccurate information. To fix this corporate "flat tire," enterprises must often select between two distinct methods of data correction: heuristic, which applies an intelligent repair process; and lookup, which replaces values that are believed to be more correct based on established "survivorship" rules.
Data Enrichment
For both users and vendors, advancing into the seemingly infinite data quality frontier requires the extension and expansion of existing data. More and more, enterprises are looking to their business partners, industry organizations, and information content providers to enrich their stores of customer data. Data enrichment through extension most often takes the form of list generation (e.g., "households with characteristics akin to our most profitable customers," or general demographic/spatial/census data), while enrichment via expansion includes everything from completing missing data, to tacking on syndicated geographic, household, or postal fields (for example, using barcodes for discounted mailing).
Users planning data quality solutions must not only consider which data quality pattern(s) they need to apply and where, but also determine the overall characteristics of these solutions.
Prominent enterprise-class customer data quality solutions for one or more patterns include those from Firstlogic, Group 1, Trillium Software, and Vality; while those targeting mid-tier, departmental, or vertical solutions include Arkidata, Data Mentors, DataFlux/SAS and Sagent. Through 2002 and 2003, users should expect vastly improved partnering (including M&A) by data quality providers with other information supply chain component vendors (e.g., ETLM, EAI, data profiling, business intelligence/analytic applications, data mining, DBMS) along with an array of data quality-related ASP offerings, and ICPs serving-up a limitless palate of certified, privatized, aggregated, industry-specific, benchmark, and unstructured information.
Data Quality Solution Characteristics and Concerns
Latency
E-business applications, for example, generally require real-time execution in all four areas, whereas sales/marketing applications may find batch execution more cost-effective.
Customer class
Business data quality solutions in B2B applications are often entirely distinct from consumer/household (B2C) solutions.
Globalization
Enterprises doing business outside the United States or North America must consider data quality solutions that can expressly handle other countries name/address idiosyncrasies and provide access to international postal files.
Auditing
Each of the four data quality patterns involves a process that may require auditing of its logic used to make a validity determinations and record modifications.
Platform Support
Some premier data quality solutions still require flat file input (i.e., no DBMS support), which may be time/cost prohibitive, or require staging data to a supported computing environment.
Tool Integration
Some Extract-Transform-Load (ETL) vendors offer hooks into one or more data quality products.
Application Integration
Many data quality products are offered in the form of one or more type of API or object (e.g., COM+, EJB, CORBA, JANA/JNI) to interoperate with production systems rather than be executed in a standalone manner.
Synchronization and Redundancy
How to manage the flow of standardized, corrected, and enriched data to each place it may be replicated, and/or how to eliminate unnecessary duplicates.
Privacy
With the power of some data enrichment solutions, enterprises are increasingly finding that they can triangulate to derive customer information that has not been explicitly offered by the customer.
Business Impact and the Bottom Line
Formal data quality practices are no longer optional for maintaining sufficient levels of business performance and managing operational costs.
Commonplace data quality solutions have moved beyond simple cleansing functions into the overwhelming need to enrich enterprise information assets. IT organizations are ill-advised in attempting to hand-code data quality solutions or use tools not explicitly suited for the purpose (e.g., ETLM, EAI, DBMS triggers), and should plan accordingly for selecting and integrating specific technologies to handle each type of data quality need.

