The Builder is the module that aims to make the life of the product team easier and to standardize the lifecycle management of all data products. This module exposes API for Data Product and components creation (leveraging Templates), maintenance, and evolution, and it will interact with the Provisioner module for deployments (by sending the Data Product Descriptor for deployments).
A Data Product is an aggregate: it is a set of components with data, code, metadata, infrastructure, controls, etc. All these components together are fundamental building blocks of the Data Product, which can't exist without its parts. When a Data Product is first created from its template inside a domain, it is just an empty set, containing only basic metadata. What are the phases that a Data Product passes through during its life?
The initial version of the Data Product is 0.0.0, and since it is treated as a regular entity (of kind "system") the repository generated will contain a
catalog-info.yamlfile representing it.
At this stage, the Data Product is "created", but is still empty, and this simply means that a repository with its definition exists.
Then a user can add one or more components to the newly created data product. All the components are entities as well, so their repositories will contain a
catalog-info.yamlwith their metadata and a reference to the data product they belong to.
Then the Data Product as a whole can be edited (by modifying the single repositories) and the versions of the various components can be updated accordingly performing snapshot, commit, and release operations; check the versioning section and the Control Panel chapter for more insights on this.
Once the Data Product is valid (you can test its validity against the data governance's policies) it can be deployed. The Data Product deployment will start the underlying deployment of all of its components, in a sequence defined by the "dependsOn" relations among them.
TACTICAL DATA PRODUCT VS STRATEGIC DATA PRODUCT
A tactical data product is a tool or system that is used to support the day-to-day operations of an organization. It is typically focused on providing information or insights that can be used to make short-term, immediate decisions or take immediate actions. Tactical data products are often used to support specific processes or functions within an organization, such as sales, marketing, or customer service.
A strategic data product, on the other hand, is a tool or system that is used to support long-term planning and decision-making at the highest levels of an organization. It is focused on providing information or insights that can be used to inform strategic decisions and shape the direction of the organization over the long term. Strategic data products are often used to support the development of business plans, budgets, and other high-level strategic documents.
In general, tactical data products are more focused on the present and immediate needs of an organization, while strategic data products are more focused on the long-term direction and goals of the organization.
DATA PRODUCT SPECIFICATION
A data product specification is a YAML file collecting the set of descriptors of the data product components. More details: Data Product Specification
DATA PRODUCT DESCRIPTOR
A structured representation of the components that make up a specific data product. It is an instance of a Data Product Specification
Inside witboost all elements are treated as entities: data products, components, domains, releases, and even users and groups.
Entities are represented by a
catalog-info.yaml file stored in a repository that tells witboost everything about that entity: type, kind, relations, metadata, annotations, etc.
When an entity is registered into witboost (by creating it using a template or by registering an existing
catalog-info.yaml file) a Location entity is saved into the database.
A Location entity holds a reference to the source entity's catalog-info.yaml. The first processing of the location inserts the related entity in the witboost catalog database. Then each location is periodically read again to detect changes and update the related entity in the catalog database (usually every 2 minutes).
The Marketplace module is the Data Product marketplace, where consumers can search and discover all the data products in your mesh, as well as accessing key information such as version, domain, status, and environments. Details: Marketplace features
Governance policies are a way for the data platform to define rules and requirements that must be respected by all the data products and components defined inside the data mesh. By using policies we can ensure that all the data products have some features (e.g. they all contain at least one data quality component) or that they all have specific values (e.g. specifying regular expressions). Take a look at the specific section for Governance Policies
The Provisioner is the module that makes a DP a consistent and atomic deployment unit. The deployment of a Data Product must occur in a single operation that has consistency. It offers standard APIs that let the user monitor the status of the provisioning operations.
Provisioning Coordinator: main entry point of the Provisioner module. It works as a dispatcher for all the components’ deployments to the Specific Provisioners. It will receive as input a Component Descriptor. It will interact with the Marketplace and the Data Catalog to communicate the deployment outcome.
Specific Provisioner: an agent responsible to accomplish specific provisioning by deploying a Data Product Component.
Validator: a module that validates the Data Product Descriptor against a set of rules (implemented using a Rule Engine). Invoked by the Builder and the Provisioning Coordinator.
Templates are a tool that can help you create components inside your Data Mesh. They can load skeletons of pre-defined code, and configurations, and then publish the skeleton content to repositories in some locations like GitLab or GitHub. Each template represents a ready-to-use repository that can be created by just inserting its fundamental details. They are used by the Builder and are the basis of Specific Provisioners. We divide them into Use Case Templates (usable from the UI to build data products) and Infrastructure Templates (provisioner).
A domain is a distinct area of business functionality or expertise within an organization. A domain is defined by a shared understanding of a specific problem space and the data required to solve it, as well as the business processes and activities that are relevant to that domain.
In Data Mesh, the goal is to build a decentralized data architecture that is centered around domains and that is aligned with the business. This means organizing data and data governance around the domains of the business, rather than around technology or data silos. Each domain is responsible for defining and governing the data that is relevant to it, and for ensuring that the data is accurate, consistent, and useful to the people who need it.
The concept of domains is an important part of Data Mesh because it allows organizations to align their data strategy with the needs of the business, and to create a more agile and flexible data architecture that can adapt to changing business needs over time.
An output port is a specific way in which a data product makes its data available to other data products or consumers within the organization. A data product is a self-contained unit of data functionality that is defined by a clear business purpose and that serves a specific need within the organization.
An output port is a way for a data product to expose its data to the rest of the organization in a way that is easy to consume and understand. It typically includes a clear definition of the data that is being made available, as well as documentation and other resources that explain how to use the data.
A workload is a specific task or set of tasks that are performed on a given data set. These tasks can include data processing, data transformation, data integration, data analysis, and data visualization.
Some examples of workloads include:
- Extracting data from various sources, such as databases, APIs, and file systems
- Cleaning and transforming data to make it more suitable for analysis
- Loading data into a data warehouse or data lake
An Infrastructure Template is a microservice that is capable of provisioning a component for a specific technology (e.g. a specific provisioner that creates Athena tables).
USE CASE TEMPLATE
A Use Case Template is a Backstage template that defines the structure of the component that will be provisioned. Multiple Use Case Templates can be handled by the same Infrastructure Template.
The Reverse Provisioning feature provides a convenient solution to effortlessly integrate metadata from an infrastructure service like AWS into a witboost component specification. You can find more information about it in the Reverse Provisioning documentation.