Data Strategy — Data Quality

Designing data pipelines using microservice architecture design patterns.

Photo by Karol D from Pexels

Data is all around us but when facing a data quality problem Data Engineers and Data Analysts today will spend hours, days, and even weeks analyzing the root cause of an issue, 95% of the companies do not fully trust in their own data therefore reliable insights is unbounded. It is very important to come up with a solution that solves this issue.

One possible option in order to solve data quality issues is to use the “CircuitBreake” pattern in microservices.

A microservice architectural pattern is an approach to developing a single application as a suite of small services, each running in its own process and communication with lightweight mechanisms, often an HTTP resource API. The basic idea behind the circuit breaker is very simple. You wrap a protected function call in a circuit breaker object, which monitors for failures. Once the failures reach a certain threshold, the circuit breaker trips, and all further calls to the circuit breaker return with an error, without the protected call being made at all. Usually, you will also want some kind of monitor alert if the circuit breaker trips.

Equivalent to the “CircuitBreak” pattern in microservices, we design circuit breakers for data pipelines. In the presence of an issue, the circuit opens preventing low-quality data from propagating to downstream processes. A question here arises, What would be the result of this?. The result is that data will be missing in the reports for time periods of low quality, this approach also eliminates the unsustainable requirement for verifying and fixing the information at the database physical level.

According to this principle, we can implement and divided the details of this approach into three sections:

  • Data pipelines
  • CircuitBreak pattern for data pipelines
  • Implementation in production

Data pipelines

Data pipelines are logical abstractions representing a sequence of data transformations required for converting raw data into insights. Data pipelines ingest data from different sources and apply a sequence of ETL/ELT and analytical queries to generate insights in the form of reports, dashboards, machine learning models, or simply holding the final result in an output table. The insights are used for both data-driven business operations as well as in product customer experiences.

Data from the sources can be ingested into a Lake (Cloud Storage, HDFS, etc) using integration tools such as Talend, AWS Glue, Sqoop, etc. Data in the Lake is then analyzed as well as moved to MPP warehouses such as Snowflake. The following is an example of a data pipeline in Talend:

Within a data pipeline, data quality issues are introduced at different stages. We can categorize the issues into three buckets: a) Source-related issues, b) Ingestion-related issues, c) Referential integrity issues. The root cause of these issues is a combination of operational errors, bad logistics, lack of change management, or a data model inconsistency. For each bucket here are the most common issues:

CircuitBreak pattern for data pipelines

Circuit Braker pattern is popular in microservices architectures, instead of having the API waiting for a slow or saturated microservice, the circuit breaker proactively skips calling the service. The end result is a predictable API response time with the trade-off that certain services may not be available. When the micro-service issue is solved, the breaker is closed and the service becomes available.

The circuit breaker for data pipelines follows a similar pattern. The quality of the data is proactively analyzed, if it is below the threshold, instead of letting the pipeline job continuing and mix high and low-quality data, the circuit is open preventing downstream processing for the low-quality data batch. There is an implicit guarantee that the available insights will always be reliable i.e., if the data was low quality, it will be missing. Data ingested in the Lake is persisted in hourly or daily batches in a staging area. Each batch is analyzed for data quality and when an issue is detected, the circuit is open preventing downstream processing of the data batch. When the circuit is open, teams are alerted to diagnose the issue. if the issue can be resolved, the batch is backfilled and made available for downstream processing.

In the Data Pipeline circuit breaker pattern, there are two states:

  • The circuit is open: Data is now flowing i.e., and the issue has been discovered, so downstream data is not available.
  • The circuit is close: Data is flowing through the pipeline.

In both cases, the quality of data partitions is continuously checked. When the pipeline meets the quality threshold, the circuits move from an open to a closed state. However, when data quality fails, the circuit moves from closed to open.

Implementation in production

Carry out circuit breakers requires implementing 3 core functions:

  • Tracking data lineage: Finds all tables and jobs involved in the transformation from source tables to output tables, data models, stored procedures, triggers, machine learning models, etc.
  • Profile data pipelines: Tracks events, statistics, and anomalies associated with the data pipeline. The profiling is divided into operational and data level.
  • Control the circuit breaker: Triggers the circuit based on the issues discovered by profiling.

Tracking data lineage is accomplished by analyzing the queries associated with the pipeline jobs. A pipeline is composed of jobs, each job is composed of one or more scrips (i.e., stored procedures, python, java); each job is analyzed for input and output. The lineage of a pipeline can be represented as a tree, every activity is glued together with the output of one becomes the input for the next job.

Conclusion

To summarize, when it comes to design data pipelines in a microservice fashion it’s best practice to implement your architecture with a circuit break design pattern if you want to mitigate bad data streaming down across your systems. With this article, I want to encourage you to think critically about the data engineering problems that you try to solve.