Data Science Series (3 of 3): Data Loading and Integration in the Cloud with AWS

Organizations depend on their data to make intelligent decisions; this decision-making tends to be carried out by senior executives and management teams. To understand the data, it must arrive in report format or similar, so that it is easy to interpret. There are many alternatives to generate reporting; however, the ideal is to build automatic reports.

About this article

This is the third article in the Data Science series Part 1 and Part 2

Each article can be read independently of the order since the content is separated by different stages that, despite having a strong connection, can be understood individually. Each publication seeks to shed light on the processes carried out in the industry and that could help you decide if your organization can hire a service to migrate your data to the cloud or learn about how the development of this type of project works in case you are a student. Some concepts are assumed to be known, which are explained in the series, so it is recommended to read the entire series if some key words sound unfamiliar.

One of the fundamental processes in Data Science projects is data loading. In this article, we will see what loading processes for visualization consist of and how data reconciliation is performed to ensure that the developed product is returning the data expected by the business.

Data loading

Data loading is the last step of the ETL flow and enables data for data visualization. It consists of presenting data in a system visually so that clients can make vital decisions. In this case, it is always sought that the data to be visualized is automated so that data transformation and cleaning tasks are performed. To carry out data loading and automation of all these processes, there are several tools; however, nowadays we have the cloud.

Why might we be interested in migrating to the cloud?

The cloud is the technology that allows storing, managing, and providing access to all your business data. Beyond allowing organizations to avoid investing money in their own equipment and data centers to manage their business information, the cloud enables companies to access a wide variety of applications and services that are fundamental and very powerful when modernizing the business, thus taking advantage of all the virtues of technology and turning that into a competitive advantage that drives their business forward.

In Cloud platforms, we can find many unique services that allow extracting all the digital potential of your organization, including tools to manage and develop Big Data tools.

Loading in the cloud

Once the transformation processes are completed, the results are sent to the cloud within a zone in a Bucket of the Datalake. Since it is the last transformation process, its results are stored in a provision-zone to be loaded into a Cloud Data Warehouse system such as AWS Redshift or AWS Glue with Athena via a Crawler. This way, the data moves from provision-zone to being available partitioned in a structured table ready to be queried by a reporting system and thus carry out data visualization and start performing Business Intelligence practices. The tools we can find to handle big data processes include the following:

AWS Glue

This is an AWS service that allows implementing ETL processes with the goal of categorizing, cleaning, enriching, and moving data. It tracks data sources and loads structured data into a destination. When used with a Crawler and Athena, it builds a catalog interpreting data types. Catalogs are metadata repositories with data formats and schemas called Data Catalog.

Glue Crawler

Crawlers analyze data to create the metadata that allows Glue and services like Athena to view the information stored in S3 as databases with tables. By making the data available in a structured format, the Glue Data Catalog can be created. In each ETL flow execution, the Glue Data Catalog is generated and updated.

Athena

It is a data analysis tool. It is used to process complex SQL queries quickly. It is a serverless service, so it does not require managing infrastructure. Therefore, you only pay per query executed. With this service, data can be queried from platforms like Tableau.

SQL Queries: SQL queries are computational processes in which information is requested from a structured database, and it returns the indicated information.

Serverless: These are platforms or systems that operate without needing a server enabled to function; instead, they are executed only when consumed, thus saving computing power and unused servers.

Redshift

It is an AWS cloud Data Warehouse service that allows running SQL queries for big data. You just need to create a cluster to start working with Redshift Data Warehouse. It is designed to work with huge amounts of data. It allows processing data in parallel to save time.

Cluster: Clusters consist of one or more servers shared as one to carry out a process that requires computing power.

Big data: Big data refers to all those processes where large amounts of information are processed.

Tableau

It is a data visualization tool for data analysis and Business Intelligence. It takes data from queryable systems like Athena or Redshift and allows visualizing data with multiple tools and multiple views.

Visualization: Visualization consists of taking the structured business data to generate charts or visual representations of the data that allow understanding the business quickly and simply so that intelligent decisions can be made.

Business Intelligence: Business intelligence consists of interpreting and analyzing as simply and quickly as possible through the use of technology and various processes so that the business can make vital decisions.

Data reconciliation

Once data is loaded into a Data Warehouse, it is available for consumption; however, whether or not sensitive data has been worked with, reconciliation tasks must be performed to ensure that the final result has not altered the expected values and thus ensure that the correct values are reached.

Reconciliation tasks are processes where the goal is to reach the same results between the data the user obtains with their manual solution and those reached in the loading or some of the transformation processes.

In some cases, instead of contrasting against a result built by the user or client, the data source against which to reconcile must be manually constructed by taking data from the ingestion zone.

Alternatives for reconciliation

There are many different alternatives to perform reconciliation. You can use Excel or programming language libraries like Python, including Pandas and PySpark.

We can reconcile the values of some columns by obtaining the total sum or the number of elements; we can also perform the same calculations or specific filters with the transformed results and confirm that the same values are reached.

The process would be similar to the following:

Download the original data from the ingestion zone.
Read the source from some tool.
Perform calculations, transformations, filters, pivot tables, etc., to reach the same results as the script.
Compare both sources and share the results with the team.

It is essential that the data source used for transformations and for the manual result is the same. Otherwise, it will be difficult to reach the same results.

Reconcile with Excel

To reconcile this way, just open your data source with Microsoft Excel or a similar tool.

Next, proceed to calculate a series of data related to the transformation file; for example, if we know that data was related in the transformation, we should validate by grouping and filtering those specific data, as well as the relevant numerical values.

You create a pivot table to validate that the groupings we generate give the same result as shown in the following image for the filters and values:

Reconcile in PySpark

PySpark is a Python library that allows performing transformations on Big Data with Spark. Below are a series of scripts that can be performed in data reconciliation processes.

Get sum of grouping

To get the sum of the values of each grouping we generate, we can use groupBy(), which allows us to specify the columns to group by followed by the type of calculation to perform on the groups, in this case a sum with sum().


dataframe.groupBy('CATEGORIA','TIPO_PRODUCTO').sum('m3').sort('CATEGORIA').show()

Remember that dataframes are a data format that can be handled through programming code using specialized Data Analysis libraries such as Pandas and PySpark.


""" Output
+-----------------+---------------+------------------+
|CATEGORIA        |  TIPO_PRODUCTO|           sum(m3)|
+-----------------+---------------+------------------+
|             null|           null|                31|
|             Ropa|         Polera|           556.291|
|             Ropa|        Poleron|         10154.277|
|       Tecnología|           Ipad|         20905.502|
+-----------------+---------------+------------------+
"""

Get count of grouping

You can use groupBy() as in the previous case or use filters to contrast that the obtained values match by counting different indicators.


count_ropa = df.where(df.TIPO_PRODUCTO=='Ropa').count()
count_tecnologia = df.where(df.TIPO_PRODUCTO=='Tecnología').count()

print('Clothing:', count_ropa, 'Technology:', count_tecnologia)

Data Science Series (3 of 3): Data Loading and Integration in the Cloud with AWS

About this article

Data loading

Why might we be interested in migrating to the cloud?

Loading in the cloud

AWS Glue

Glue Crawler

Athena

Redshift

Tableau

Data reconciliation

Alternatives for reconciliation

Reconcile with Excel

Reconcile in PySpark

Get sum of grouping

Get count of grouping

Fixing reconciliation errors

Focusing error detection

Transforming values that alter columns

Fix separator in transformation

Fix sum of values

Previous Posts

Google Apps Scripts: Automation and Efficiency within the Google Ecosystem

Augmented Coding vs. Vibe Coding

Kraneating is also about protection: the process behind our ISO 27001 certification