Skip to main content
Data Science Series (3 of 3): Data Loading and Integration in the Cloud with AWS | Kranio

""" Output
Clothing: 2314 Technology: 232
"""

Fixing reconciliation errors

It is extremely common that once the data is reviewed, it does not match; this does NOT always mean that the transformations were done incorrectly. It may also happen that the data source is different or that when entering the data into a system to reconcile it, we are misinterpreting some numerical values, etc. We must identify in which cases the values are not matching, proceed to isolate them for analysis and comparison.

Focusing error detection

Next, we will correct a reconciliation error with real examples, in this case correcting a count of values that does not match. A useful alternative to identify the source of this error is to obtain a grouping of a column used in the transformation filtering by the value that does not match; this way we can count the values with the goal of exporting this result and being able to compare the file in Excel.


df = df.select('group_column').where(df.TIPO_PRODUCTO == 'ROP').groupBy('group_column').count()
df.show()
""" Output
+-----------------+---------------+
|group_column     |      count()  |
+-----------------+---------------+
|       32413EH200|              1|
|       30413EH300|              2|
|       30413EH400|              1|
|       33313EH500|              3|
+-----------------+---------------+
"""

# We export the file to compare it in Excel
test.write.csv('grp_matnr', sep=';')

We create a tab in Excel where we add the result of the exported CSV and the same calculation from the script but done in Excel, each with their corresponding columns, as follows:

This will directly show us which values differ from our script and will allow us to focus our analysis on a few data points instead of thousands of them.

Transforming values that alter columns

On the other hand, we may encounter cases where some column values are altering the script’s calculations or disabling their calculation from the reconciliations. To fix this, we can generate scripts that transform the resulting data or the ingested data. For example, below it was identified that the use of double quotes in some values caused problems, so they are removed from the script.


df = dataframes_dict['main']

# Dictionary with the data types of each column
columns_types_dict = {
	#...
  'product_desc': StringType(),
}

# We iterate over each column indicated in the dictionary
for column_name, column_type in columns_types_dict.items():
	...
	
	# We remove the double quote from all Strings 
	elif column_type == StringType():
	  df = df.withColumn(
	    column_name,
	    regexp_replace(column_name, '"', '')
	  )

Fix separator in transformation

Sometimes the separator in the output file does not match the tool we use for reconciliation. Fortunately, fixing the separator of the generated document is very simple to do at the time of exporting in the script.


test.write.csv('export', sep=';')

It is also possible to fix the separator by searching all values where the separator appears and replacing it in all cases with another. For these tasks, a tool like Notepad++ can be very useful.

Fix sum of values

Sometimes when obtaining data from management systems like SAP with raw data, we may find data that we must transform to make it readable from the Data Warehouse and so that it can be reconciled, having to convert formats as follows:

410.500,210 ⇒ 410500.210

Script to fix decimal format


df_double_obj = ['peso_neto', 'cantidad', 'us$_fob', 'us$_cif', 'us$_flete', 'cantidad_aux', 'espesor', 'volumen_m3', 'densidad']
    
    for column_name in df_double_obj:
        if column_name not in data.columns:
            continue
        data[column_name] = data[column_name].apply(lambda x: str(x).replace('.','') if ',' in str(x) else x)
        data[column_name] = data[column_name].apply(lambda x: str(x).replace(',','.') if ',' in str(x) else x)

Sometimes we cannot use the script to reconcile and must correct the file manually. For example, in Excel, numerical values with incorrect decimal formatting do not allow sums to reconcile; this is because the numbers are misinterpreted, altering the real value of the results.

As in the previous case, it is possible to modify all these types of formats with replacements in tools like Notepad++.

Ready to optimize your data loading and validation in the cloud?

At Kranio, we have data solutions experts who will help you implement efficient loading and reconciliation processes using AWS tools, ensuring the quality and availability of your information for strategic decision-making. Contact us and discover how we can drive your company’s digital transformation.

Previous Posts

Google Apps Scripts: Automation and Efficiency within the Google Ecosystem

Google Apps Scripts: Automation and Efficiency within the Google Ecosystem

Automate tasks, connect Google Workspace, and enhance internal processes with Google Apps Script. An efficient solution for teams and businesses.

Augmented Coding vs. Vibe Coding

Augmented Coding vs. Vibe Coding

AI generates functional code but does not guarantee security. Learn to use it wisely to build robust, scalable, and risk-free software.

Kraneating is also about protection: the process behind our ISO 27001 certification