Streaming data quality — How to implement column profiling using Kafka Streams?— Data Quality Series

Photo by Ma Joseph on Unsplash

There is no doubt that poor quality data will have an impact on business outcomes. “Getting in front on data quality presents a terrific opportunity to improve business performance”, writes Thomas C. Redman in the article seizing opportunity in data quality published in MIT Sloan Management Review.

The cost of bad data is an astonishing 15% to 25% of revenue for most companies. Two-Thirds of These Costs Can Be Eliminated by Getting in Front on Data Quality — Thomas C. Redman author of Getting in front on Data

There is another article by the same author Improve Data Quality for Competitive Advantage published in MIT Sloan Management Review in 1995 well before the big data buzz and the new set of data processing tools existed.

Data failures are embarrassing and costly.

A database is like a lake. To ensure a clean lake, one must first eliminate the sources of pollution. Too many databases/ lakes are horribly polluted. Just as those who drink from a lake are wise to treat the streams that feed it as an asset, those who use data are wise to recognize data-quality problems, to treat processes that create data as an asset, and to apply quality systems to those processes. — Thomas C. Redman

As the data grows in volume and becomes real-time, why not implement data quality management processes on streaming continuous data. This post is the first in a series of posts on implementing data quality principles on real-time streaming data. With these series of posts, we will create a Kafka Streams API based functional data quality tool on real-time streaming data.

Kafka is a de-facto standard for a centralized messaging backbone for many enterprises and startups. It is a highly successful messaging infrastructure powering event-driven micro-services and it can also be used to perform data quality management activities.

Though Kafka Streams API is a library that can be embedded in any Java application, Streams API processes single event at time and is heavily dependent on the underlying Kafka broker so it does put some pressure on the Kafka cluster. To avoid this it would be a better design to separate data quality activities from the critical path of Kafka broker and other systems by mirroring the data to another Kafka cluster on which the data quality streaming apps can be run.

A dedicated Kafka Cluster for Data Quality Management

For stateful operations, Streams API writes back to the Kafka broker and so it is important to have the Kafka cluster scaled up in terms of memory and computing based on the number of events expected to be processed per second. This is well documented on Confluent documentation Streams Operations Capacity Planning and Sizing. In particular with stateful stream apps, a) local storage space for interactive queries and b) global storage space (Kafka broker cluster) for fault tolerance and internal operations (changelog) to be monitored.

Data profiling is the starting point for data initiatives. Profiling helps find issues with the data. Data warehouse projects start with data profiling as the first activity. It’s possible to apply the same principles on big data irrespective of batch time or real-time streaming data, yet, the challenges with the real-time data are different than profiling on offline or batch data sets.

Below are four general methods used by data profiling tools.

1. Column profiling
2. Cross-column profiling
3. Cross-table profiling
4. Data rule validation

In this post, we will create a single column profiler on continuous data. The app calculates the completeness of column, defined by

Completeness = (total records - empty records) / total records

A value of 1.0 is similar to applying a SQL NOT NULL constraint on the table.

Single Column Profiling Streaming App

The streaming app is designed with inspiration from an excellent Kafka Streams demo application Kafka Music Demo. The design of the app is as below.

Data Quality Column Profiler Streams App

The column profiling results are available to the external system via REST API service that queries from the state store, aka interactive queries. The REST endpoint /dq/cp/{storename} continuously outputs completeness check results.

The app assumes the event data is in CSV format at a later point t data format be changed to have JSON, AVRO, or other event formats and can be backed up by a Schema Registry). The column names provided as input arguments to the DataQualityApp, default to column1, column if none provided.

The Topology for completeness calculation

The meat of the code is the streaming topology that checks each column of each record if its blank or empty and updates metrics in the state store.

What’s next

  • Generalize event schema, currently app only supports CSV format
  • Add serializers CSV, JSON, Avro
  • Enhance column profiling metrics with column datatype, statistics for numeric fields (min, max, mean and standard deviation)
  • Value distribution histogram for categorical string fields
  • Distinct counts vs Approximate counts (Apache DataSketches)
  • Windowing the column profiling metrics, currently the metrics such as empty count are calculated continuously and keep growing, windowing concept gives option to group data for a definite period to quickly get the gauge of the data quality and also lets build histograms and analyze the changes.

The functional demo app code included with test cases is available below.

Please signup here or to receive bi-monthly newsletter with all updates directly in your inbox.

Disclaimer: All the opinions expressed are my personal independent thoughts and not to be attributed to my current or previous employers.

All things product, data, time-series, and real time streaming analytics