Data Product Versioning
Data Product change examples
Breaking changes
You have a data product version 1.0.0 with 2 workloads (A, B), and 2 output ports (a file storage and a SQL view). Let's say you want to remove the output port file storage. Since consumers might depend on this output port, this is a "breaking change" ❌ thus requiring a new version of the data product. So we will need to:
- create a new data product version 2.0.0
- remove the undesired output port from version 2.0.0
- deploy the new data product (for the other components, since they were copied, but data was not, you have full control over what happens: for example, you can create a different bucket/directory while changing the version even if data hasn't changed)
- notify the consumers so they can set up a migration plan from 1.0.0 to 2.0.0
- maintain both version 1.0.0 and 2.0.0 up and running in parallel while there is at least one active consumer of version 1.0.0 (if there are no consumers of the SQL view output port, this could be immediate)
- when there are no more consumers for version 1.0.0, undeploy it
Non-breaking change
For the same data product 1.0.0 as in the previous scenario, but in this case for some mine internal reasons, you need to get rid of workload B. After you ensure that data at the output ports have no changes (breaking) concerning consumers, you can just simply edit the components accordingly, and release a minor change. Please note that an even less impacting component change could be removing a storage area, by definition since it is for internal use only.
When adding a component, being an internal storage area as well as an output port, the change is always backward compatible, so it does not require a new version of the data product to be created.
Versioning
Now, let's take a moment to focus on the repositories and how they will be versioned and organized:
in each repository, we store all versions of an entity. For example, you may have a data product repository Customer
and two of its components' repositories: CustomerS3OutputPort
and CustomerSparkWorkload
. The repositories will have a branch called master
that represents the main branch for version 0, and other branches called release/1
, release/2
, etc that will act as main branches for other data product versions. These branches can be thought of as the main
branches of distinct repositories.
To better clarify this concept, when we talk about versioning we need to highlight two main concepts:
- entity version: the complete traditional version in the form “major.minor.patch”. All entities (so each component and the data product as well) have a distinct version that is related to the features it contains. The entity version can be found in the
catalog-info.yaml
files. - data product version: each data product version represents an independent data product, distinguishable by the major version number. Two data product versions of the same Data Product can be considered as two completely different data products. All components of a Data Product will share its data product version as the major number of their entity version. The data product version changes when breaking changes are introduced and are stored in separate branches.
Each data product version represents a Data Product: all the repositories (the data product one and the component ones) are part of it. To avoid having a new repository each time we create a new version of a data product, all repositories will have multiple release branches identified by the data product (major) version (e.g. release/1, release/2, etc). Each of those branches must be treated as the main
branch of that specific data product version, and once created will represent a completely independent component from the one before it. So, if you need to make a change to a component belonging to data product version 2.3.0, you should create a feature branch from that component repository at branch "release/2" and merge it back to it when completed. The master
branch represents version 0 of the data product and is not intended to be updated with features added to other data product versions.
In the future, we can think about removing the master branch in favor of a "release/0" to avoid confusion, but for the moment the master
branch must be treated as a release/0
branch. All release
branches must be treated as the main
branch of the data product of that version: you can always think of them as being different repositories, just grouped into a single one to avoid repositories proliferation.
This solution lets us define only a small set of repositories while maintaining the Git commit history for every one of them while granting consistency between the components version and release version. In the catalog, each data product and component version is seen as a completely different and independent repository, even if they are just different branches of the same repository to improve the developer experience.
A new data product version should be created only in case of compatibility-breaking changes. Initially, all repositories will have a main
branch (usually called master
) which represents version 0, and then all subsequent versions will have a dedicated release branch (release/1
, release/2
, etc). Inside these branches, we will define the specific component version in the catalog-info.yaml
(e.g. 1.0.0 for all entities initially, Data Product included). So a starting scenario could be: Data Product release version: 1, Customer
version = 1.0.0, CustomerS3OutputPort
version = 1.0.0, CustomerSparkIngestion
version = 1.0.0.
Each time a compatibility-breaking change is introduced in one component, we need to create a new data product version for the whole data product. This will result in the creation of a new release/2
branch in all the repositories. Each of those will contain an initial component version 2.0.0.
If we perform non-breaking changes, we will simply upgrade the version of the specific component we are editing (we will not change the release version). e.g. if you add a log in the Spark code of your CustomerSparkIngestion
it will simply update its minor (or patch in case of defects) version. This will not create a new dedicated release branch like what is done for new data product versions, but will simply commit to the current main
branch for the data product. So in this case we would have: data product version: 2, Customer
version = 2.0.0, CustomerS3OutputPort
version = 2.0.0, CustomerSparkIngestion
version = 2.1.0.
When an update draft release operation is performed, the system takes the current draft release of each repository of the selected release version, creates a descriptor by merging them, and stores it in the data product repository in a release
directory.