Kranio Methodology: Keys to Successful Data Projects

Many companies develop initiatives around their data by implementing Data Lakes or other technologies that allow extracting, storing, and organizing the maximum amount of data to perform analysis and make better decisions. The goal is to better understand their customers' needs, improve service quality, predict and prevent outcomes by leveraging meaningful insights from all this information.

With organized and stored data, we can process and analyze current and future scenarios, answering questions such as: what is happening (Descriptive), why it is happening (Diagnostic), predict what will happen (Predictive), and what effective solutions you should take to optimize and increase efficiency (Prescriptive).

Value - Complexity Matrix in Data Projects - @2020 Kranio.io

The Data Lake is a living, dynamic, and evolving system that receives data from different sources (structured and unstructured), with a variety of formats, where data arrives raw, not optimized or transformed for specific purposes. Therefore, it is important to know and understand the characteristics, regulations, and standards that data must meet to be consumed by users.

You should always consider defining data governance when implementing a Data Lake. The first thing is to understand: What is data governance? There are many definitions; for us, it encompasses all the processes, policies, and tools that ensure the security, availability, usability, and quality of data, guaranteeing that only authorized users can access, explore, and exploit the data, that it is up-to-date and intact, avoiding the risk of exposing sensitive and confidential information of individuals and organizations.

Deciding how to protect data, how to maintain consistency, integrity, and accuracy, and how to keep it always updated are the points we will cover in this document and are part of our Kranio data process.

How do we do it? - The Secret Ingredient

Methodology for Data Projects - @2020 Kranio.io

Below we briefly describe the Kranio methodology applied to the data process, composed of several stages not necessarily sequential:

‍

Preparation

This is where the data process begins. We initially define with the business and incorporate an agile framework, leading an organized design sprint, with time, participants, and context, to obtain and define KPIs, iteration cycles in product construction, and understand business definitions and rules. We conclude with an initial backlog of activities that will be prioritized during project execution.

Depending on the project type, we use or propose a framework based on agile methodologies, with our greatest experience being the use of Scrum or Kanban , which allows prioritizing tasks over time, defining follow-up routines with the business to provide visibility of the product to stakeholders (we recommend creating a weekly dashboard).

Regarding communication, tracking, and documentation tools and methods, our vision is agnostic; we adapt to what clients have, and if they have none, we recommend and implement the minimum to meet expectations and project success, supporting the establishment of standards for their data projects.

The key factors we consider in the preparation phase of a data project are:

Engage with the business needs and expectations, understand the problem and the value the project implementation will generate. Why are we doing the project? What is the real value for the business? Having clarity on this allows not only taking business requirements but also contributing with our experience and making suggestions and value proposals during execution.
Identify the interlocutors and stakeholders, clearly establishing their roles within the project.
Initially identify the information sources available, both external and internal to the Data Lake. This helps manage expectations better, allowing early alerts to prepare and present action plans.

Data Ingestion

As important as understanding the problem to solve is understanding the data available. Usually, clients have a way of storing and manipulating data that does not necessarily mean it is stored correctly, or they may lack the necessary platforms to do so. With this, we begin to define and agree with the client on a technological architecture oriented to satisfy current business needs and designed for future usability, scalability, and ease of maintenance.

With the architecture defined, we identify the sources we need to integrate, the best way to extract data (tool or technology), the frequency, and also evaluate if the data contains structures that allow identifying people (PII) or if it contains confidential data. This is important to provide the appropriate treatment and later store it in an organized and secure manner in the format and structure defined in the Data Lake. We must analyze the currently available information to tend to reuse the maximum amount of information.

Kranio's DataOps begin building the digital products (code) that allow moving data from various sources to the Data Lake. We can use countless tools and services that support extraction and storage processes; however, creating a data pipeline is vital as it helps automate data validation and loading processes. It provides a centralized orchestration and monitoring tool, incorporating process execution tracking, alert generation, error logging, and audit trails.

Check out how to create a simple and robust data pipeline between legacy applications and the Data Lake in this video

Checklist you should ensure at this stage:

Define Standards:

Establish programming languages, code repositories, libraries, cloud services, orchestration, monitoring, management, and auditing tools.
Generate parametric code, that is, never leave static code inside programs; use configuration files/tables.
Use standardized nomenclature for buckets and stored files.
Define the format in which we will store data in transformation and consumption layers.
If the project requires defining a Data Model, ideally, it should not be oriented to a specific requirement; we must think about scalability, a robust model that lays the foundation for future requirements, not just the specific project.
Products must have validation points to ensure that, besides completing correctly, they generate balanced and consistent data.

Generate monitoring and auditing processes:

Provide traceability of all executions (successful and failed).
Record all actions of data captures, transformations, and outputs.
Provide sufficient information to minimize analysis time in case of failures.
Provide a centralized and easily accessible log repository that allows us to solve problems.

Ensure product quality:

Product delivery must include evidence of the respective quality control performed. We guarantee not only that a process runs but that it runs well and with the expected result.
Generate evidence of data reconciliation, considering evidence of what was reconciled and under which scenario and conditions it was generated.
Reliable data with automated validations and reconciliations with high monitoring coverage prevents questioning the credibility of the digital product in case of any error or discrepancy. Your best ally is delivering certified, error-free, consistent, guaranteed work.
Products designed for operational continuity, providing all resources for easy delivery and control takeover by the operations team.
End-to-end automated products avoid manual interventions that risk operational continuity.

All the above will allow us, besides certifying the work, to have all implementations follow the same guidelines and way of doing things, optimizing construction time, improving quality and clarity of understanding, and traceability of each part of the process.

Data Processing and Enrichment

With data securely and organized in the Data Lake, we establish within the framework the procedures and transformations to move from raw stored data to information usable by clients.

Knowing that data comes from multiple sources that may be unreliable, it is vital to have a process for analyzing data quality and usability. This process can start manually but should end in automated processes with tools. The role of data architects and engineers is vital as they help evaluate and detect if more information is needed to define a reliable dataset that allows the required analyses. They also help identify missing data for the correct creation of the dataset fundamental to project development and eliminate incorrect, duplicate, or incomplete data.

In this process, specialists implement best practices of data projects (BigData), organize information through various categories and classifications as data allows, analyze each subset independently, and perform structuring or enrichment transformations by adding new columns, calculated data generated from existing data, or incorporating information from other external sources. The final data can be made available to users through a relational model in a data warehouse, a data access view, or a consumption file.

Leveraging the “intelligence” that data provides is the role of the data scientist, who is responsible for analyzing simulated scenarios to make predictions, provide mathematical modeling, statistical analysis, machine learning techniques, develop predictive analysis, clustering, regression, pattern recognition that bring new data enriching the project and adding more value to clients.

Information security and correct user access to the Data Lake must be established in data governance, where ownership, access methods, security rules for confidential data, data history, source origins, and others are identified.

Exploitation, Consumption and Operational Continuity

The last aspect is how users will use the data. Regarding consumption and exploitation of generated data, the focus must be on the business user's needs, with whom we co-design a solution proposal that meets the requirements and parameters defined in the preparation phase. This work includes reviewing the requirements gathered during project preparation and selecting a platform suitable for their visualization needs such as Tableau, PowerBi, AWS QuickSight, or others. You should create a storyboard for different user categories and prepare a customized and efficient design for understanding the data being presented.

Good data discovery and exploration work will define the basis for having a self-service data visualization platform, where users obtain insights to improve management and decision-making intelligently and backed by reliable information. For example, see how a company ‘listens’ to what its customers say in forms, contact centers, and social networks and uses it to improve customer service.

Data quality and reliability are vital since if data arrives poorly at the platform, there will be errors in visualizations and dashboards, and generated files will not provide the information the client requires. You should emphasize this point since analysis accounts for 70% of the total time needed to create a good dashboard.

Well, not everything ends with delivering a dashboard or file to clients; you must also think about the future, that is, consider other important aspects:

Operational continuity of the developed digital products. Most of the time, operation and monitoring in the production environment will be the client's responsibility, so the focus is on easing their workload, minimizing monitoring time, minimizing the time it takes to review problem causes, and minimizing failure resolution time.
Scalability of the solution in all components such as infrastructure, architecture, along with the tools used during development that allow growth as the business requires.
Ease of tracing eventual problems. Being able to quickly find information, have an initial diagnosis, and a correction plan.
Minimize complexity of processes to simplify future adjustments or improvements.
All projects include complete documentation that facilitates collective understanding.
Avoid using multiple tools that in the long run do not add business value.
Generate communication instances with the client that allow an effective and clear handover of all work done. The clearer the client is about all the good work done, the higher the satisfaction level, always considering the interlocutor's profile (operations area, business, and others).

Conclusion

Ensure success in the design, implementation, and development of a data project by applying a data project methodology. In this article, we show you the Kranio methodology applied and refined in dozens of data projects in various countries. If you apply this Methodology, you have a higher chance of meeting expectations and avoiding mistakes that can ruin the project.

Another fundamental aspect you must ensure is business user participation. They are the ones who consume data to improve decisions. From the start, at each stage, robust, aligned, and well-communicated teams deliver better projects.

Do you want to review your methodology or implement this Methodology for successful projects?

Ready to take your data projects to the next level?

At Kranio, we apply a proven methodology that ensures success at every stage of your data initiatives. Our team of experts will guide you from planning to implementation, guaranteeing effective results aligned with your business objectives. Contact us and discover how we can collaborate on your company's success.