Data validation refers to the process of ensuring the accuracy and quality of data. It is implemented by building several checks into a system or report to ensure the logical consistency of input and stored data.
In automated systems, data is entered with minimal or no human supervision. Therefore, it is necessary to ensure that the data that enters the system is correct and meets the desired quality standards. The data will be of little use if it is not entered properly and can create bigger downstream reporting issues. Unstructured data, even if entered correctly, will incur related costs for cleaning, transforming, and storage.
Types of Data Validation
There are many types of data validation. Most data validation procedures will perform one or more of these checks to ensure that the data is correct before storing it in the database. Common types of data validation checks include:
1. Data Type Check
A data type check confirms that the data entered has the correct data type. For example, a field might only accept numeric data. If this is the case, then any data containing other characters such as letters or special symbols should be rejected by the system.
2. Code Check
A code check ensures that a field is selected from a valid list of values or follows certain formatting rules. For example, it is easier to verify that a postal code is valid by checking it against a list of valid codes. The same concept can be applied to other items such as country codes and NAICS industry codes.
3. Range Check
A range check will verify whether input data falls within a predefined range. For example, latitude and longitude are commonly used in geographic data. A latitude value should be between -90 and 90, while a longitude value must be between -180 and 180. Any values out of this range are invalid.
4. Format Check
Many data types follow a certain predefined format. A common use case is date columns that are stored in a fixed format like “YYYY-MM-DD” or “DD-MM-YYYY.” A data validation procedure that ensures dates are in the proper format helps maintain consistency across data and through time.
5. Consistency Check
A consistency check is a type of logical check that confirms the data’s been entered in a logically consistent way. An example is checking if the delivery date is after the shipping date for a parcel.
6. Uniqueness Check
Some data like IDs or e-mail addresses are unique by nature. A database should likely have unique entries on these fields. A uniqueness check ensures that an item is not entered multiple times into a database.
Consider the example of a retailer that collects data on its stores but fails to create a proper check on the postal code. The oversight could make it difficult to leverage the data for information and business intelligence. Several problems can occur if the postal code is not entered or entered improperly.
It can be difficult to define the location of the store in some mapping software. A store postal code will also help generate insights about the neighborhood where the store is located. Without a data check on the postal code, it is more likely to lose the value of data. It will result in further costs if the data needs to be recollected or the postal code needs to be manually entered.
A simple solution to the problem would be to put a check in place that ensures a valid postal code is entered. The solution could be a dropdown menu or an auto-complete form that allows the user to choose the postal code from a list of valid codes. Such a type of data validation is called a code validation or code check.
Data Validation in Excel
The following example is an introduction to data validation in Excel. The data validation button under the data tab provides the user with different types of data validation checks based on the data type in the cell. It also allows the user to define custom validation checks using Excel formulas. The data validation can be found in the Data Tools section of the Data tab in the ribbon of Excel:
Data Entry Task
The example below illustrates a case of data entry, where the province must be entered for every store location. Since stores are only located in certain provinces, any incorrect entry should be caught.
It is accomplished in Excel using a two-fold data validation. First, the relevant provinces are incorporated into a drop-down menu that allows the user to select from a list of valid provinces.
Second, if the user inputs a wrong province by mistake, such as “NY” instead of “NS,” the system warns the user of the incorrect input.
Further, if the user ignores the warning, an analysis can be conducted using the data validation feature in Excel that identifies incorrect inputs.
Thank you for reading CFI’s guide to Data Validation. In order to help you become a world-class \analyst and advance your career to your fullest potential, these additional resources will be very helpful: