Monitoring Data Quality with Amazon Deequ

Published in

Data Engineer Things

4 min readApr 5, 2021

Data quality is a quite broad topic that consists of different components, including schema validation, data cleaning, data profiling, unit testing, and monitoring.

In the article, I will explain how Amazon Deequ could be used for monitoring data quality with examples. You will learn how to:

Create profiling rules by Deequ
Write results to InfluxDB
Visualize results in Grafana

About Amazon Deequ

According to information from Amazon Deequ developers, Deequ is a library built on top of Apache Spark for defining “unit tests for data”. It measures data quality in large datasets. The entire project can be found here.

We can use Deequ to calculate various metrics (such as CountDistinct, Distinctness, Maximum, Mean, Min, Uniqueness, and others) against datasets. All calculations are performed on top of Apache Spark, which makes the library highly efficient and scalable.

Environment Setup

To follow this tutorial, you will need:

Apache Spark versions 2.2.x to 2.4.x and 3.0.x.
Docker
Scala/Java

First, set up and connect to InfluxDB by running the following commands in a console:

# pull influxDB docker image
docker pull influxdb:1.8.4# run docker container
docker run — rm — name=influxdb -d -p 8086:8086 influxdb:1.8.4# connect to docker container and run influxDB console
docker exec -it influxdb influx# create database which we will use for storing our metrics
create database example

Then, install Grafana (in a different console):

# pull grafana docker image
docker pull grafana/grafana# run Grafana docker container & connect it with influxDB container
docker run -d — rm — name=grafana -p 3000:3000 — link influxdb grafana/grafana

Now you can clone this repository from GitHub and open it in your favorite IDE.

Generate Data Quality Metrics

The InfluxDBMetricRepository object contains the main logic for loading input data, calculating data quality metrics, and writing results to InfluxDB.

The Metric Repository is an interface I created to allow the saving of Deequ’s computation results to different systems/formats. Out of the box, the library offers the capability to save results as a DataFrame or to the file system as JSON or in memory. You can extend this interface and add implementations for different systems/formats. In this tutorial, we will use this interface to save our results to InfluxDB.

Execute the InfluxDBMetricRepository object, and then run the following commands in the InfluxDB console to check the results:

use example
select * from InfluxDBMetricsRepository

You should see the output as below:

If you run this example several times against the same data, you will find results for each execution. In InfluxDB, you can define the retention period (how long the data should be stored).

Visualization in Grafana

Now it’s time to create a dashboard in Grafana. Open this link in your browser: http://localhost:3000/ (the default username/password is admin/admin).

Let’s create a new Data Source for our InfluxDB. Go to Configuration — > Data Sources — > Add Data Source and enter the following configurations:

Name: InfluxDB
URL: http://influxdb:8086
Database: example
User: admin
Password: admin

Press Save & Test .

Then define query parameters:

You can adjust other dashboard parameters to achieve the necessary results.

Conclusion

You can use Amazon Deequ not only as a library for unit testing but also for data quality monitoring. In this example, I created the MetricRepository as a connector from Deequ to InfluxDB, but you can create your own or use the FileSystemMetricRepository from Amazon (which will require you to process input files with a tool of your choice). Of course, you can build and visualize derived metrics and even send notifications to data stewards.

References

Amazon Deequ project page
Test data quality at scale with Deequ — link
InfluxDB MetricRepository & example — link
InfluxDB key concepts — link

Data Engineer Things

Monitoring Data Quality with Amazon Deequ

About Amazon Deequ

Environment Setup

Generate Data Quality Metrics

Visualization in Grafana

Conclusion

References

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in Data Engineer Things

Written by Alexey Artemov

No responses yet