Extract Transform Load (ETL) is a three-step process that extracts data from a variety of operational applications (such as Customer Databases, Sales Databases, and Product Databases), transforms and improves the consistency of the data, and ultimately loads the finished data into a Data Warehouse for use. See my previous post on ETL for more details. Although the ETL process has many uses, there can be some errors and discrepancies in data that may be detrimental to the process. Therefore, the ETL process needs to be tested to validate the data and make sure there are no issues. By testing the data through a multi-step process, the data is ensured to be accurate and of use to a business. For some resources to get you started with this concept, see this ETL testing wiki page.
Almost every business (whether it be engaged in goods or services) may make use of data. Data can offer unique insights on consumers and trends and preferences may be drawn from large sets of data, impossible to see otherwise. Therefore, a sophisticated and tested ETL process is necessary to make sure the data is of the highest quality and is of use for a business. The ETL testing process can be divided into five stages:
1) Identify Data Sources:
It is important to consider what one wants out of data before deciding to use it. ETL process designers must first examine the data sources and perform multiple tests on the desired data. Data profiling is monitoring the quality of the data over a period of time and tracking changes. This allows for analysis on the quality and performance of the data. If the data does not meet a pre-specified requirement, it is unfit to be used.
2) Data Extraction
There are tests that need to be in place during the extraction phase of ETL to make sure the process is running correctly. When acquiring data, it is crucial to analyze the metadata to make sure that there is consistency in the data. If the data is unable to be transformed to meet the requirements of the data warehouse where it will be stored, it is unusable. Therefore, a testing process known as target mapping may be used to ensure consistency. Target mapping outlines the path from source files and compares them to the end data base tables. This will show which data parts need to be transformed in order to cooperate with the structure of the database.
3) Dimensional Modeling
Data that is being loaded into a data base must be subjected to dimensional modeling to ensure that that data is queryable. By using dimensional modeling techniques in data base design, the data that is imported has improved understandability, query performance, and extensibility. It is important to test that the dimensional modeling techniques were effective in meeting these three objectives. Read more about dimensional modeling at this Dimensional Modeling wiki page.
4) Populating Data
When finally populating a database with data, there are many techniques to test the quality of the data and that the process has transformed and loaded the data correctly. Test need to be conducted to ensure that the data has referential integrity, and has been indexed and partitioned properly. They also need to ensure that errors are logged and recoverability is possible.
The final stage of ETL testing is to develop sophisticate reporting. These reports need to be sophisticated and indicative of the status of your data and how successful the ETL process was. The data within the reports should be able to be analyzed for trends in the data process (which could highlight areas of improvement). Also, it is helpful if the data has drill-down capabilities, therefore, insights can be made on specific details such as the date or specific tests conducted.
As with every process, ETL testing may not be perfect. The goal of testing is to ensure that the data processing is as accurate as possible, however, some errors may still occur. One issue that plagues businesses is the loss of data during testing. If the data does not meet requirements, it is unusable. By reducing the amount of usable data, the inferences that are drawn may be less accurate as the sample size is smaller. Another issue that may complicate the ETL process is the sheer size of the data. The larger the data, the longer it takes to process. Most data drawn from the web is unstructured and unique, therefore, it is challenging to create a database architecture that is easily compatible with all the data sources a business is drawing from. Due to this high volume of data performed, any inefficiency in the testing process will be magnified. A small mistake could have lengthy consequences because the huge amounts of data from many sources are being tested.
An ETL tester has responsibilities that they must adhere to when testing the ETL process. In order to be as effective as possible, there are some best practices that a tester can follow. The first step is to check after transformation that data was transformed in the desired fashion. It is also important at this stage to check that no data has been lost due to the transformation process. When loading the data into a warehouse, the tester also has the responsibility of observing whether the warehouse properly reports invalid data and is loaded properly. The tester then needs to make sure that this loaded data has the scalability and performance requirements that are desired. If it does not meet these requirements, the data will be of low quality and have little use to anyone.
There have been advances in ETL testing and some companies have taken strides to make it a much more agile and effective process. Alooma is a company that has created a testing environment that is more agile-oriented than the traditional waterfall development process. Just as much of the software production in today’s work environment is created by automation, Alooma has developed automatic testing functions such as performance evaluation, metadata testing, and production evaluation. This differs from the traditional system because this testing is conducted during regular operation of the ETL process, limiting time required to test.
Whether it be a human-induced mistake or a code-induced mistake, there are bound to be errors in any lengthy process, especially an ETL process. It is necessary for the ETL process to be tested at each stage to ensure that the data is uploaded smoothly and is of high quality. Traditionally, the ETL process has been conducted with a waterfall-based ideology. I, however, believe that a more agile-oriented environment will be more efficient in the future.
“ETL Testing or Data Warehouse Testing Tutorial.” Guru99, Guru99, www.guru99.com.
Lindsay, Walter. “What Is ETL and ETL Testing?” Liaison , Liaison Technologies, 22 Mar. 2017, www.liaison.com.
“ETL / Data Warehouse Testing – Tips, Techniques, Processes and Challenges.” Software Testing Help, Software Testing Help, www.softwaretestinghelp.com.
V, Anoop. “What are the challenges of ETL Testing ?” ZYXWARE, ZYXWARE Technologies, www.zyxware.com.
“ETL Testing Challenges.” Tutorials Point, Tutorials Point, www.tutorialspoint.com.
“Conquering the challenges of Data Warehouse.” Software Testing Blog by Cigniti Technologies, Cigniti Technologies Limited , 13 Apr. 2017, www.cigniti.com.
Amar, Rami. “ETL Testing: The Future is Here.” Alooma, Alooma Inc., www.alooma.com.
“Dimensional modeling.” Wikipedia, Wikimedia Foundation, 29 Dec. 2017, en.wikipedia.org.