Data Quality & GiGo Menace.

Eliud Nduati
4 min readNov 14, 2022

We’ve all come across the statement, “garbage in, garbage out!”. If you work in a data field, this statement hits differently.

Photo by John Schnobrich on Unsplash

The use of data in the business setting has become synonymous with productivity and efficiency. When working with data, high-quality data is needed to get the most relatable and quality results, which are later reflected in the decisions. You need data to plan your business processes, analyze the data, and interpret the predictions and insights to give you an edge. This might sum up the point I am trying to make. Are you still wondering why we are talking about data quality?

We come back to the statement summed up in “GIGO.” When we use low-quality data in our analysis or make predictions, the results and insights we get are of low quality. We sometimes have minimal impact on our planning and efficiency bits on business activities. No one wants to use ill-informed insights or inputs in business decisions- it means losses in revenue and your competitive edge. On the other hand, having high-quality data means that whatever decisions you make are informed and are bound to give you an advantage in competing against your rivals in business and increasing your revenue. High-quality data also means you reduce the cost of preparing it to meet the requirements for any analysis you might have. So how do we measure data quality, you ask… hold on!

Dimensions of data quality. Data quality is measured under six dimensions. These include:

  1. Completeness
  2. Accuracy
  3. Validity
  4. Consistency
  5. Integrity
  6. uniqueness

Let’s explore each of these in simpler terms 😄

Completeness

This attribute simply asks whether the data has all the vital characteristics to be considered usable and to solve the problem at hand. If you want to map your customer locations, you need to have the customer addresses. Any data point that misses this information is incomplete. When collecting data, in this case, you must ensure that all the vital information needed for this specific purpose is captured. Data is only complete if it includes bits of this essential information or information that would help answer the impending questions you have at hand.

Accuracy

Accuracy in data relates to how the data represents the real world. When collecting data about our customers, you must ensure that these details are correct and accurate to the particular customers when you collect phone numbers and date of birth. When you realize that the phone numbers are wrong, they cannot be used to contact your customers, which means they are inaccurate.

Validity

Let’s mention something about validity using the previous scenario where we were collecting dates of birth (discussing collecting date of birth). When you check your data and realize there are customers over 100 years old and others less than 15 years old, your data is more likely to be invalid. However, this also depends on your business activities. If you are collecting data about the end of care homes, ages above 90 and 100 are likely in some cases. If your data is about children’s games, getting people over 20, 30, and 50 players is expected to be invalid data. (depends, though 🤣).

Consistency

Data consistency mainly relates to data accuracy. Simply put, data consistency checks whether the data stored in one location matches similar data stored in another. If you have customer data in your sales data records and similar records in your accounts department, consistency dimensions check whether the data is similar and to what percentage. The data, if representing the same clients, should be consistent and accurate. Suppose the data is different because some errors or discrepancies in one record are supposed to match another. In that case, the data quality is suspicious and needs to be monitored and corrected.

Integrity

Data integrity encapsulates all the other dimensions of maintenance, assurance, accuracy, and consistency over its entire life cycle. Data integrity is about maintaining the data’s true state across its lifecycle. When the data loses its integrity, the related data record becomes invalid, inaccurate, or inconsistent, which affects its overall quality. In other instances, data integrity relates to the safety of the data and its compliance with different data regulatory policies such as GDPR.

Uniqueness

The data record should appear once in the dataset. Uniqueness ensures no duplicate values in your dataset. It reduces cases of overlaps. Having a single entry for each data point makes it possible to ensure compliance and customer engagement.

When checking data quality compliance, these are the dimensions one focuses on. These are also the main checks when doing data cleaning for analysis. Next, we will look at how to ensure we collect quality data in the collection process.

--

--