Data Transformation in ETL for Machine Learning: A Practical Guide with Pandas and Advanced Techniques | Kranio
Data Transformation in ETL for Machine Learning: A Practical Guide with Pandas and Advanced Techniques
Team Kranio 11 de noviembre de 2024
Compartir:
Data Transformation in ETL
Data transformation is the second crucial step in the Extract, Transform, and Load (ETL) process, which plays a fundamental role in preparing data for advanced analysis and Machine Learning modeling. This step involves a series of operations designed to convert raw data into a format that is more suitable and useful for the specific analysis that needs to be performed. Let's explore in detail the basic and advanced transformations using Pandas, data normalization and structuring, text cleaning and manipulation, feature engineering, handling categorical data, and data validation techniques.
Basic and Advanced Transformations with Pandas
Pandas is a powerful tool in Python for data manipulation due to its ability to efficiently handle complex data structures.
Basic Transformations with Pandas
Basic operations in Pandas are fundamental for daily data manipulation and can include column selection, row filtering, and data sorting.
Column Selection: To select a specific column from a DataFrame, you can simply use the column name inside brackets.
Output:
2. Row Filtering: You can filter rows based on logical conditions. For example, to select all rows where age is greater than 28:
Output:
3. Data Sorting: To sort data by a column, you can use sort_values(). For example, sort by 'Edad':
Output:
Advanced Transformations with Pandas
Advanced operations allow for more complex transformations and are particularly useful for preparing data for statistical or Machine Learning analysis.
Grouping and Aggregations: Group data by one or more columns and then apply aggregation functions such as sum, average, maximum, etc.
Output:
2. Conditional Transformations: Apply transformations based on conditions. For example, increase the age by 1 year only for those in 'New York'.
Output:
3. Pivot Tables: Pivot tables are useful for summarizing a dataset. For example, create a pivot table showing the average age by city and name.
Output:
Data Normalization and Structuring
Data normalization and structuring are two fundamental processes in data treatment for analysis and modeling, especially in the context of Machine Learning projects and advanced data analysis.
Data Normalization
Normalization is a method to scale numerical data within a specific range or according to a particular distribution, which facilitates the comparison and analysis of different features that may have different scales or units. There are several normalization methods, each with its own use cases:
Min-Max Scaling:some text
Description: This method scales data to be within a specific range, generally 0 to 1, or -1 to 1 if there are negative values.
Formula:
Usage: Useful when you need a strict numerical range for your model and when you are not concerned about outliers that may distort the rescaling of data.
Z-score Standardization (Standard Scaler):some text
Description: It consists of rescaling data to have a mean of 0 and a standard deviation of 1.
Formula:
Usage: Especially useful when data has a normal distribution, and it is more robust against outliers compared to min-max scaling.
Example:
print(data)
Output:
â
Here, minâĄ(A)=1 and maxâĄ(A)=5. Applying the formula to each value:
For A=1: (1â1)/(5â1)=0.00
For A=2: (2â1)/(5â1)=0.25
For A=3: (3â1)/(5â1)=0.50
For A=4: (4â1)/(5â1)=0.75
For A=5: (5â1)/(5â1)=1.00}
And thus applying the other formula
â
Data Structuring
Data structuring refers to the process of organizing and formatting data so that it is easily accessible and analyzable. This can include reorganizing data into new structures, consolidating data sources, and transforming unstructured data into structured formats.
DataFrame Restructuring:some text
Operations: Can include table pivoting and joining multiple data sources.
Tools: Pandas offers functions like pivot_table, merge, and concat to facilitate these processes.
Data Format Conversion:some text
Description: Convert data from semi-structured or unstructured formats (like JSON, XML) to tabular structures (DataFrames).
Implementation: Use specific Python parsers to read these formats and load them into Pandas.
Example:
Output:
â
Text Cleaning and Manipulation
1. Removal of special characters and numbers Many texts include characters that are not relevant for analysis, such as special symbols, numbers, and punctuation. Removing these characters can make the text more uniform and easier to analyze.
Output:
2. Conversion to lowercase Converting all text to lowercase is a fundamental step to standardize data and avoid the same words being interpreted as different due to case differences.
Output:
3. Removal of extra spaces It is common to find additional spaces that should be removed to maintain consistency.
4. Tokenization Tokenization is the process of splitting text into smaller units, such as words or phrases. This is useful for more detailed text analysis techniques, such as word counting or vectorization.
Output:
5. Removal of stopwords Stopwords are words that do not add meaning to the text and can be removed. Common examples in Spanish include 'y', 'que', 'de', etc.
Output:
â
These steps and techniques will help prepare text data for more complex analyses and machine learning models, ensuring the text is clean and standardized.
Now, Feature Engineering is a fundamental part of the transformation stage in ETL processes, especially when preparing data for Machine Learning models. Here, you transform and create features that help improve the performance of predictive models.
Feature Engineering
1. Creating features from dates Dates can be broken down into multiple features such as year, month, day, day of the week, etc., which can be useful to capture seasonal patterns or trends over time.
Output:
â2. Binning of numeric data: This consists of converting continuous numeric variables into discrete categories, which can be useful for models that work better with categorical features.
Output:
â3. Categorical features based on conditions: We can create new categorical features by applying specific rules or logical conditions on existing data.
Output:
4. Feature interactions: Creating new features through the interaction of existing features can reveal relationships that are not otherwise evident.
Output:
5. Logarithmic and square root transformations Logarithmic and square root transformations are useful to reduce skewness in data distributions.
Output:
6. Encoding categorical variables Converting categorical variables into numeric formats is essential for many machine learning algorithms.
Output:
Handling Categorical Data
1. Label Encoding Label encoding transforms each category into a number. It is useful when categories have a natural order (ordinal), but indiscriminate use can imply an order relationship where none exists, potentially leading to misinterpretations by models.
Output:
2. One-hot Encoding One-hot encoding converts each category into a new column and assigns a 1 or 0 (True/False). It is ideal for nominal variables without inherent order.
Output:
3. Binary Encoding Transforms categories into binary and then breaks them down into individual columns. It is more space-efficient than One-hot, especially for categories with many levels.
Output:
Each of these methods has its advantages and limitations, and the choice depends on the specific context of the problem and the requirements of the Machine Learning model to be used. It is important to test different approaches and select the one that best preserves relevant information and contributes to the model's performance.
To conclude, it is crucial to perform data validation to ensure the quality and reliability of data in ETL processes, helping to detect and correct errors before the data is used for analysis or modeling.
Data Validation Techniques
1. Range Checking: This technique involves verifying that data values are within a specific range defined by business rules or domain logic. For example, age in a survey should not be negative or unrealistically high.
2. Referential Integrity Validation: This technique ensures that identifiers or keys in one table correctly correspond to those in other tables, maintaining data consistency across different parts of the database.
3. Format Checking: This consists of ensuring that textual data complies with specific formats, such as postal codes, phone numbers, email addresses, among others.
4. Uniqueness Checking: This technique verifies that there are no duplicates in data that must be unique, such as user IDs or serial numbers.
5. Completeness Validation: Checks that no values are missing in datasets, especially in columns essential for analysis or decision-making.
6. Consistency Validation: Ensures that data in different fields are consistent with each other, based on logical or business rules.
â
In conclusion, data transformation is a critical phase within the ETL process that lays the groundwork for informed analysis and decision-making.
Implementing these techniques correctly ensures that data is not only accurate and consistent but also relevant to the specific business or analysis requirements. Data cleaning, normalization, proper structuring, and validation are essential steps that, although often underestimated, have a direct impact on the quality of insights that can be derived from the data.
Moreover, the transformation stage is not only about manipulating data to fit a usable format but also about adding value through feature engineering, where creativity and domain-specific knowledge play a crucial role.
Finally, it is essential to develop a data transformation workflow that is both robust and flexible, allowing continuous adjustments and improvements as business requirements change and technology advances. This ensures that an organization's data infrastructure is not only sustainable but also remains competitive and relevant.
Ready to optimize data transformation in your Machine Learning projects?
At Kranio, we have experts in data engineering and Machine Learning who will help you implement efficient ETL processes, ensuring your models are trained with clean and structured data. Contact us and discover how we can boost your artificial intelligence projects.â
Kraneating is also about protection: the process behind our ISO 27001 certification
At the end of 2025, Kranio achieved ISO 27001 certification after implementing its Information Security Management System (ISMS). This process was not merely a compliance exercise but a strategic decision to strengthen how we design, build, and operate digital systems. In this article, we share the process, the internal changes it entailed, and the impact it has for our clients: greater control, structured risk management, and a stronger foundation to confidently scale systems.