Data Profiling vs Data Testing
December 17, 2020 •Eaujenae Francisco
Data Profiling
Data profiling is simply discovering and summarizing data that is in an existing source. Whether you’re viewing it for the first time, or you are monitoring the health of your data, data profiling will help you learn about the state your data is currently in. You’re able to analyze relationships between tables, assess the range of values for a particular attribute, and even see how large your tables are. All of these measures can give a complete view of your data while also revealing inconsistencies and possible errors.
A good data profile will include statistical measures such as mean, minimum, maximum, counts, and percentiles that can be analyzed to provide a quantitative story about your data. For example, to determine how complete your data is, you can compare the count of null values in a column to a count of non-null values. If the null counts are higher you know that a majority of your data is missing. You can use that information and measure it against your expectations to determine there is a possible error you need to hunt down.
Data profiles are able to provide insights on both column and table level. By focusing on one column, you can learn about the behavior of an attribute. You are seeking information about things like data types, constraints, the range of values and which values appear the most often. If you profile the entire table you would want to know about the uniqueness of each record, what the functional dependencies are, and what the primary and foreign key relationships look like. Each of these metrics help support the data model.
Data Quality Testing
While data profiles are used to examine data sets that already exists, data quality testing is used to validate how well existing and newly created data sets meet expectations. The kind of high-level expectations or key data quality metrics you’ll want to check for is if the data set is complete, unique, valid, consistent, accurate, and timely. You want to know if your data is coming from its source system as intended and is it moving through your data pipeline as designed.
Data quality tests are usually run by writing SQL based test scripts against a database. By querying the database, you’re asking your data questions and hoping that the answer it returns is what you expect. Data testing is more flexible in its execution than data profiling. Instead of having a standard set of metrics to view, tests can be customized to evaluate business-specific processes. Every data quality metric should be evaluated to ensure that data is trustworthy, but the evaluation of each metric will look different from company to company.
The benefit of having a comprehensive data profile is that it helps keep data consistent over time and it increases company knowledge surrounding your business’s data. Understanding data types, constraints, and relationships is what turns data into information to make well informed business decisions. Thorough data quality testing ensures that business processes work as expected at that end-users are using trustworthy data. The goal of both approaches is to increase trust in your data. Once the trust is established, it must be maintained, and combination of data profiling and data quality testing is what enables that experience.
Validatar Help Organizations Know Their Data
It can be time consuming and tedious to get to know your data, especially within organizations where data is constantly changing. There is a better way. Validatar, is an automated data quality management tool that helps organizations know and trust their data. From data discovery, to data profiling; data testing and regular data monitoring, Validatar puts improves trust in an organizations data, leading to better data, better insights and better results.
Want to learn more about Validatar, connect with us.