Introduction to Data Science: Types of Data and Key Roles in Projects

Data Science projects constitute all those developments where data is extracted from various sources, manipulated, and visualized to perform analysis.

For these projects to be built, it is necessary to understand the client's business along with the data they possess in order to develop a solution that delivers value to the organization and allows them to make decisions.

About this series and article

This is the first article in the series: “Data Science”. Each article can be read independently of order since the content is separated by different stages that, although strongly connected, can be understood individually. Each publication aims to shed light on the processes carried out in the industry and that could help you decide if your organization can hire a service to migrate your data to the cloud or learn about how the development of this type of project works in case you are a student. In this first part, we will talk about the value of data and the roles played by clients, users, and developers in Data Science projects.

The data

Data is all information that is useful for a company; organizations today can access a vast amount of information. This includes internal data from the organization, external data from clients, and external data from the industry or competition. Companies that have digitized their operations therefore generate data that can be captured, processed, and analyzed.

To work with data, it is first necessary to store it. To do this, we have several alternatives. Cloud computing services such as Google Cloud Platform or Amazon Web Services (although there are others) are extremely efficient and cost-effective, as each provides a variety of services that help us store data efficiently and securely.

The value of data

To obtain value from data, we must capture, store, and structure it in a way that business decisions can be made. Data cannot only be used to analyze past or current situations but also to make predictions and take intelligent actions. This means that after capturing the data, a way must be found to extract true value from it.

Once we capture, identify, or enable a data source, we must store it. We can differentiate data storage into two distinct storage systems, which we will explain below.

Data Warehouse vs Data Lake

Both data warehouse and data lake services aim to solve the same problem: storing large amounts of data. Their main difference lies in that data lakes are designed to store raw data. Meanwhile, the data warehouse stores structured information that has already been filtered and previously processed to reach the structure that the data stored in the data warehouse possesses.

Structured data vs unstructured data

When storing data, we can encounter two formats:

Structured data: This is highly organized data, such as customer records, tables, or tabular data and other data that tends to be quantitative. The advantage of this type of data format is that it can be easily stored and managed in databases. It should be considered that this type of data is generated by building models and structures that allow data to accumulate in an orderly manner. This type of information is stored in Data Warehouses.

Unstructured data: This is data that is not organized and is characterized by tending to be qualitative, containing undefined information and coming in many formats. Examples of this type of data are: images, audio, and PDF files. This type of information is stored in Data Lakes.

Quantitative: It is all information that can be measured.

Qualitative: It is all information that cannot be measured and for which measurement scales or models must be created.

Below we can see a figure that describes the differences between structured and unstructured data.

Both structures can be used to obtain results and make intelligent decisions. However, historical data that is unstructured is much more difficult to analyze. But with the right cloud tools, value can be extracted from unstructured data by using APIs to create structure.

Historical data: This is the information that organizations generate over the years, which is generally disorganized and comes from various sources.

APIs: These are tools that allow integration or communication between two different systems.

Roles

To carry out a project, effective communication must exist between three of the roles present in a data project: The client, the user, and the development team.

The Client

The client plays a fundamental role in this type of project, since when working with data from an organization, it becomes imperative that the development team throughout the entire project construction process understands how the client company operates and how they work with their data. Essentially, it is about understanding the business logic.

This understanding of the business is generated hand in hand between the developers and the client; the development team must ensure to resolve all their doubts regarding the business operation and the use of data. On the other hand, the client must be able to resolve all these doubts, which will make the difference in achieving a good result.

Meetings with the client

The first thing done when starting the project is to begin a stage consisting of a series of meetings with the client to understand their expectations and their workflow, thus defining the solution they need. The first meetings held between client and developers are called “requirements gathering meetings”, and all subsequent meetings are called “understanding meetings.” Both meetings aim to get to the heart of the problem to be solved and ideally jointly determine the value data has for the business. However, the true objective of these meetings is to understand the business logic (hence the name). This understanding process includes knowing how data is obtained, what manual processes are carried out with it, how it is presented, and ultimately how it is expected to be visualized or accessed. In summary: the goal of the understanding meetings is to trace the data that configures the business logic.

During project development, it is recommended that the client appoint a Product Owner to ensure that communication with the client is even more effective and agile. At the same time, a Product Owner on the development team can guarantee or maximize that the efforts made by the development team directly target what the client is looking for, thus reducing discarded development or time investments in development that are later modified or discarded because they stray from the client’s needs.

The Product Owner in agile projects is the team member who belongs to the client’s organization and supports the developers and Scrum Master to carry the project forward aligned with the vision and requirements of their own organization.

Automated client ingestion

For the correct development of the project, the client must ensure that the development team has sufficient data available to build the logic.

The “upload” of these files is usually stored in ingestion zones in some cloud computing service, such as AWS S3.

The ingestion zones correspond to those directories or spaces within a Data Lake where data that will be part of the ingestion process in the pipeline or data flow is stored. These are used by the development team to test and build the expected result.

However, this process is not limited only to the development process; for the data flow of the solution to operate correctly, new data must be periodically supplied to these ingestion zones. Generally, the frequency with which this data is uploaded is directly related to the occasions when the entire process is triggered, especially when working with a serverless project.

A serverless project is one that only consumes resources and only runs when required.

The user

When developing the solution, it must always be built oriented toward the user. For a data project, a user can be from the management of an area and its supervisors to a worker who has been generating reports for some time and who has already built the solution manually. This solution will now be moved to the cloud to automate their work.

Users are required by the development team to understand the solution to be developed and in case the solution already exists and is to be moved to a Cloud service.

To ensure that the solution is consistent with what the user seeks, they must have a meeting directly with the development team. This way, the team that will develop the solution will have all the context and can understand the data model and how the necessary metrics are obtained. As with the understanding meetings, this is an iterative process in which all details related to the data, specifically the flow through which it passes, must be resolved.

Development team

The development team, made up of professionals from various IT disciplines, is responsible for delivering the solution.

For example, we may find professionals fulfilling roles such as: Data Engineer, Data Ops, Dev Ops, Cloud Engineer, Data Analyst.

The project context changes throughout the development, and the team must always be informed and focused. Feedback from both users and clients allows developers to build a deliverable that meets the expectations of both roles. We can summarize the concepts seen in this last section with the following image:

In the next article of the series, we will see in detail what an ETL flow is and how data is extracted and transformed. We hope this article has been helpful; if you have any questions or your organization needs support to solve projects of this type, do not hesitate to contact us.

Ready to start your journey in data science?

At Kranio, we accompany you at every stage of your Data Science projects, from data identification and storage to the implementation of analytical solutions. Our team of experts is ready to help you transform your data into strategic decisions. Contact us and discover how we can drive your company’s digital transformation.

‍