Upgrade your Data Engineering Game with Deequ: A Framework for Flawless Analytics

Teepika R M
5 min readJun 26, 2023

In today’s era of exponential data growth and rapid data processing, ensuring data quality is paramount. High-quality data serves as the base for generating valuable and reliable insights.

Let’s dive into the world of bad data and see possible misinterpretations/ bad data resulting in bad business. It helps to understand the importance of maintaining the data quality to produce reliable insights.

Different types of bad data:

  • Missing Data:

Missing data can introduce bias, resulting in under-fitting, where the lack of sufficient data for important columns reduces the statistical power of the analysis.

  • Inaccurate Data:

Impacts of negligence of handling incorrect data can lead to even financial losses or operational inefficiencies. For instance, inaccurate sales data can lead to poor demand forecasting. It can impact the company’s revenue.

  • Duplicate Data:

The drawbacks of duplicate data starts from unneccessary data storage to misleading data analysis. With respect to data analysis, duplicate data creates data skewness. Skewness refers to an imbalance or distortion in the distribution of data. It disrupts statistical calculations like averages, medians, SD etc,.

  • Inconsistent Data:

Inconsistent data severely impacts smooth data integration processes. For instance, consider data format inconsistencies, where the same column follows different formats in different source systems. It disrupts the integration. Similarly, different naming conventions in different source systems also create impacts.

  • Outliers:

Outliers are data points that significantly deviate from the normal range. Being extreme values it can create significant impact when calculating statistical measure. Also in data science problems, outliers have the power to heavily influence the ML model that it performs bad on test/unseen new data.

  • Incorrect Data Types:

Incorrect data types can lead to computational errors in data processing operations. Implementing arithmetic operations on data of unsupported datatype can cause errors or unexpected results. It leads to operational disruption or erroneous results.

  • Incomplete Data:

Incomplete data restricts the ability to produce comprehensive insights. Without complete information, we may not be able to predict patterns, trends or correlations in the data. Correlation is nothing but two features (columns) being either positively or negatively related with each other.

Positive Correlation: House price increases, potential rent price increases

Negative Correlation: House price increases, interest rate decreases

  • Inappropriate Data:

Inappropriate data is the data that does not meet the standard or rules for a specific purpose. It can impact the organization’s reputation and erode trust among stakeholders slamming the business.

  • Stale Data:

Stale data is the one that is outdated or no longer relevant for analysis. It can lead to delayed actions or making decisions. It leads to missed opportunities, delayed response to market trends etc,.

  • Inferred or Imputed Data:

Inferred data is nothing but derived data from the actual data. So it is important to have the base data clean and of high quality to avoid any pitfalls. Inferred data generation involves making assumptions or generalizations which can lead to overlook the specificities in the original data. So I need to be careful in making sure the base data is clean and clear.

It is important to note down that the above mentioned use cases are possible ones and not the only ones. Different cases are discussed in detail to make the readers aware of the harsh reality.

When designing data quality checks , it is important to think in a whole 360 degree angle to efficiently catch all the shortcomings.

Past Personal experience in handling data quality on production scale:

Project overview:

It is a data migration project. The goal of this project is to transfer data from a legacy Oracle database system to the Hadoop Distributed File System (HDFS). It involves extracting data from the source, transforming to the required format and loading to HDFS for further analysis using big data pipelines.

Specificities involved for migration:

Type of Data: JSON files with customer details, sales transactions, product details and inventory records.

Volume of Data transferred: Around 500 GB involving many millions of records.

Tech Stack involved: Big Apache spark cluster of size 10 to 20 instances was used to implement the data migration. Apache Spark provides a JDBC connector that allows you to establish connections with oracle databases and extract data. The cluster involved for the project is ephemeral meaning, short lived ad-hoc cluster, created specifically for implementing the migration.

How data quality is ensured on the migrated data?

Data validation rules are defined based on the target system’s requirements. The rules include verifying data types, formats, values, validating relationships between entities to ensure integrity. Automated data quality checks with quality reports are generated Alerting mechanism for any deviations from expected data quality standards is also implemented.

Tech stack involved for implementing data quality:

Business rules validation approach is used to ensure the quality. Apache spark with scala programming language is used to build the framework.

From the scratch implementation with fine grained control, involves manual efforts starting from environment setup, resource provisioning, implementation and testing. But all these manual efforts can be reduced with Deequ Framework which gives the flexibility of “plug and play”.

Current Take on implementing data quality on production scale:

What is Data Profiling?

Data profiling is the process of analyzing and creating useful summaries about data

The same data quality check can be implemented effortlessly with Deequ framework. It can be started with data profiling to gain insights about data. With Deequ’s profiling capabilities, summary statistics can be generated. The statistics help to understand the data better. With Deequ’s constraint API, project specific business rules can be incorporated for validation. Results of the constraint checks can be found under ‘VerificationResult’. The results can be inspected to find the violations. It is provided with details like violated constraints, column names, and specific data values causing the issues. Based on severity, we can fix the violations and rerun the quality check. We can schedule periodic checks or implement feedback loops based on requirements.

Note: Apache Deequ framework itself is built using Apache Spark.

Key Features of Deequ:

Data Quality check: Deequ lets you define custom data quality checks for ensuring the expected standard and correctness of data.

Automated Profiling: It possesses capability to automatically generate summary statistics to analyze data.

Anomaly Detection: Deequ allows you to set custom thresholds or rules to find outliers that deviate from normal distribution of data or you can also use statistical methods like mean, standard deviation to flag outlier datapoints.

What is data drift? Shift in the data flow due to change over time

Data Drift Monitoring: Deequ enables you to monitor data drift over time by comparing data profiles between different time points.

Constraint Suggestions: It provides suggestions for new constraints based on the data characteristics and existing constraints.

Deequ being a customizable and extensible data quality framework is a powerful library. By incorporating Deequ into your data pipeline, you can automate the process of data quality assessment and monitoring, saving time and effort while improving the reliability of your data.

--

--

Teepika R M

AWS Certified Big Data Specialty| Linux Certified Kubernetes Application Developer| Hortonworks Certified Spark Developer|Hortonworks Certified Hadoop Developer