Data Science Series (2 of 3): Data Extraction and Transformation in ETL Projects

Nowadays we find ourselves increasingly connected to the internet, meaning that we consume more information and consequently generate it in equal or greater proportion. This vast sea of data is a competitive advantage for organizations capable of capturing it and obtaining true value from it. We saw this in detail in the previous article - "Data Science Series: Data and Roles (1 of 3)", which explains the value of data, the different types we can find, the alternatives for its storage, as well as defining the roles and processes involved when hiring this type of service.

However, in this article, we will focus on more technical aspects, such as: Data extraction and transformation. To understand how this can allow us to activate data and start generating new value.

About this article

This is the second article in the series about Data Science. Each article can be read independently of order since the content is separated into different stages that, although strongly connected, can be understood individually. Each publication aims to shed light on the processes carried out in the industry and could help you decide if your organization can hire a service to migrate your data to the cloud or learn about how the development of these types of projects works in case you are a student. Some concepts are assumed to be known, which are explained in the series, so it is recommended to read the entire series if some key words sound unfamiliar.

There are different professionals involved in the development of Science projects; however, in this case, we will focus on the role played by the Data Engineer professional.

What role does a Data Engineer play?

The Data Engineer builds ETL flows so that data can be easily queried and in the appropriate format from different platforms. These flows contain a series of transformations that are nothing more than scripts that must be programmed to modify and enrich the data. It is expected that the Data Engineer creates automations to periodically execute the flow and ETL processes. Essentially, the Data Engineer ensures that the rest of the roles have data to analyze. Below is a detailed explanation of what the ETL flow consists of.

Enrich the data: It is any process through which a dataset undergoes a transformation resulting in the creation of new columns or new data within it. It can also result in standardizing and cleaning the data after the transformation.

ETL Flow

To understand where to extract our data from and why we must transform it, we must first understand what the ETL flow is.

ETL stands for Extract, Transform, and Load. ETL is a process that extracts information from different data sources into a Data Lake, transforms the data in a path or zone, and finally loads it into a Data Warehouse to be consumed.

Extraction

In extraction, data is collected from various sources; the selected data is stored in a staging area. It is quite common to take data sources from files extracted from SAP or other similar platforms that manage business data.

Staging area: It is an area entirely dedicated to data processing; it is from where data is taken to perform transformations and at the same time where processing files are stored, which consist of files generated and used by other transformation steps in the ETL process.

During the extraction process, the selected data can have various formats, including:

Tables
SQL databases
NoSQL databases
Unstructured data
Text files

It is important to leave the extraction data in a staging area because it is very common that extraction data contains inconsistent, dirty data with an unpredictable variety of formats. This way, by placing the data in a specific space, it can be read by the process and all its errors corrected through transformations, thus generating new data, with the correct format and ready to be analyzed.

Data is considered inconsistent when information in the same column or document has values with different formats. For example, if we have a column with date values and receive some values in the format 2021/11/09 and others in formats like 2022-03-10 14:30, etc.

Transformation

In the transformation process, the data loaded during extraction is converted to a standard format to be loaded into another zone of the Data Lake or a Data Warehouse, thus obtaining results to analyze or present.

Transformations are performed multiple times with different scripts during the same flow or pipeline. These are generally done in scripts that can be built with different languages and libraries. The most common is to use Python as the programming language and Pandas or PySpark as libraries to work with data.

Data pipeline: A series of data processes executed in series where each process has an input and an output that continues the flow.

In transformation scripts, sequential tasks are performed to ensure all elements process certain tasks such as:

Apply formatting: We can define the data type of each column and perform transformations according to the assigned data type.
Apply filters: To obtain only the values we need.
Clean columns: This way, we can apply a transformation on empty or null values and at the same time unify values that are repeated or recorded differently but refer to the same thing.
Integrate data: We can obtain new data by integrating two different data sources or columns.

Load

In this process, the transformed data is loaded into a Data Warehouse. This is the final result of the ETL process, and the idea is that the entire flow from extraction to loading is executed regularly at specific time intervals.

Once the data is available in a Data Warehouse, it can be queried and used to visualize data from a reporting system like Tableau or Google Data Studio. Next, we will explore in depth the second process, that is, how data is transformed.

Activating the data

Being able to leverage available data is essential for organizations, even more so if you want to lead the business through a digital transformation process. One of the most common ways to leverage data is by taking two or more datasets and obtaining new results by mixing the data.

We can build new data based on some criteria; if we take a sales dataset and another of suppliers, by relating them, we can obtain a new column indicating the supplier's seniority or create a new value by combining the content of values from both datasets.

Dataset: Datasets are sets of different data that are cohesive or generated from the same source. They generally have a tabular format, meaning they have rows and columns. An example would be data generated by a user when making an online purchase.

Data mapping

To start taking advantage of an organization's information, it is first necessary to have the data mapped. To do this, a data map can be generated, which consists of a diagram identifying all data sources generated in business processes.

The base component from which to start identifying the data being generated is the datasets. We can group a set of datasets under the same origin to form a data Bucket. For example, we can have a Bucket with all User/Client datasets, another with Corporate data, and finally a Bucket with Industry/Competition data.

Identify unmapped data

These three Buckets make up the data map. If more datasets are added to the Buckets, much richer conclusions can be drawn with this new information found.

However, there may be occasions when there is data that we are not mapping or locating correctly but know it exists. In this case, we can start from the Buckets to determine which dataset we are missing. For example: My organization is generating data from some operations that we have not yet captured, so in the data map, we indicate that it is pending to obtain a dataset in the Corporate data Bucket.

Cloud migration patterns

Cloud migrations allow systems and solutions to access various benefits such as: developing more agile and secure solutions, working with data from the cloud saving costs and generating new value. Also, if the on-premise solution is obsolete or legacy, it can be completely transformed to integrate other services and modernize it.

Moving solutions to the cloud is not a process that always follows the same rules; it depends on the context. There are at least five patterns that can be adopted when you want to modernize an on-premise solution with the cloud.

On-premise: Refers to applications or systems developed and run locally or on the business's own infrastructure.

01. Move and change

This pattern consists of moving the application to the cloud, making minimal modifications so it operates, and once running, updating and improving the original solution more simply and agilely.

02. Change and move

The next pattern is similar to the previous one, only the steps are reversed. The on-premise solution is modified to prepare it for the cloud, and once ready, the migration is performed to start running the application and apply any changes or improvements needed, taking advantage of the services associated with the selected cloud provider.

03. Maintain and extend

This consists of building a new solution from scratch in the cloud that is an extension or connects with the on-premise application. An example of this pattern is creating APIs to connect the local application with other services or applications that cannot be integrated into the original application.

04. Maintain and replace

Another very common alternative is to keep the on-premise application running while creating a new application in the cloud environment so that once developed, it replaces the on-premise solution and thus begins to obtain greater value and agility for the business.

05. ONLY Move

In some cases, it is enough to just free the application or data to modernize the infrastructure layer. This can be done with simpler or more common products. For example: migrating the organization's data to the cloud or modernizing an application's infrastructure by creating a virtualized environment for disaster recovery, among many other cases.

Next, we will see a concrete case where transformations are applied to a specific case.

Transforming the data

As described above, transforming information helps us make the most of the data we have, while also giving greater control over it to analyze it.

Use case

To demonstrate how transformations are performed, we will use the construction of a report as an example.

This on-premise report may take a certain amount of time to complete and is built by taking many different data sources, which must be manually queried and configured to obtain the desired metrics. Then all the data is gathered in an Excel file, and several calculations are applied. The final report would have a format similar to this:

The resulting report simulates data from a company distributing various products; we can see that it is a report with concentrated information on the results obtained with some products, as well as the calculation of the budget and Forecast for each one. This is the resulting table, and to get there, we must perform transformations on different sources, specifically a series of transactional and master data.

Transactional data

Transactional data is data processed at certain periods, which each time they are processed usually bring new data, modifications of previous data, or the same data.

In our example, we have several transactional data, which we will call venta_ropa.csv, venta_accesorios.csv, and costos_margen.csv

What we will do is take the sales sources and group data based on tipo_producto. At the same time, we need to relate this data with the master source to obtain descriptive data that will be used to create the columns name and description.

Master data

Master data or master tables are data that usually do not change over time and have predefined values. In the example, these are tables that have the codes of each product along with their description.

In our example, we will use only one master data called tipo_producto_detail.csv

Both transactional and master data would have a format similar to the following:

Transformations applied to the case

In this case, we will solve the transformations using PySpark, a library for Big Data processes designed to work on Data Lakes and Data Warehouses. It focuses on processing speed and can be used with Spark in Python, Java, R, and Scala. In our case, we will use Python as the programming language.

When starting to create scripts, it is common to first develop scripts called scen, which contain all the logic and tests to build the script, and later, once tested, this script is converted to the development version with slight modifications.

Read data

To read data in PySpark, we first use read.format() to indicate the type of file we are going to read; then with .option(), we select options separately; in this case, we use it to define the file delimiter and whether it includes headers. Finally, with .load(), we indicate the path where the file is located.

To read data in PySpark, we first use read.format() to indicate the type of file we are going to read; then with .option(), we select options separately; in this case, we use it to define the file delimiter and whether it includes headers. Finally, with .load(), we indicate the path where the file is located.

In the example, we read only venta_ropa.csv, but the logic is the same for the other sources.

Headers: The first row of the table containing the names of each column.

# Read the file
ropa_df = spark.read.format('csv').option('delimiter', ';').option('header', False).load('/datasets/transactional/venta_ropa.csv')

# Assign the headers ropa_df = ropa_df.toDF( 'NAME', 'PRODUCT_CODE', 'DATE', 'QUANTITY', ... )