Data Contracts – Everything You Need to Know — Science and Data

Data Contracts – Everything You Need to Know — Science and Data

In many companies people are familiar with the concept of Service Level Agreements (SLAs), sometimes referred to as Service Agreements. These written agreements outline the details of what a customer expects from a service provider, as well as what can happen if the service provider fails to meet those expectations.

Data contracts, often employed in a federated architecture, serve a similar purpose. Except, instead of referring to services, a data contract is an agreement between a service provider and data consumers. It refers to the management and intended use of data across different organizations or sometimes within a single company.

The goal? Ensure reliable, high-quality data for all parties involved.

But what is a data contract and what does it look like?

Despite the intimidating name, data contracts are not as complicated as they first appear. And they can be incredibly useful for improving accountability across all data assets. Below, in addition to exploring what a data contract is, we’ll look at why they’re needed, when to use them, and how to implement them.

Why are data contracts necessary?

First of all, let’s think about why we need data contracts.

Data teams depend on systems and services, often internal, that generate data that is taken to the Data Warehouse and becomes part of different processes. However, the software engineers responsible for these systems are not tasked with maintaining and are often unaware of these data dependencies. So when they make an update to their service that results in a schema change, these tightly coupled data systems crash.

Another equally important use case is data quality issues. They arise when the data brought into the Data Warehouse is not in a usable format for data consumers. A data contract that enforces certain formats, constraints, and semantic meanings can mitigate such instances.

We know that organizations are dealing with more data than ever before, and responsibilities for that data are often distributed across domains; this is one of the key principles of a Data Mesh approach.

The name can be a bit misleading in this case – data contracts are not detailed legal documents, but a process to help data producers and data consumers get on the same page.

The more widely distributed data becomes, the more important it is to have a solution that ensures transparency and builds trust between teams using data that doesn’t belong to them.

What’s in a data contract?

For those who haven’t seen it, the idea of ​​creating data contracts can be daunting. But once you’ve defined a format for a data contract, ensuring maximum readability, creating one can be as simple as a few lines of text.

Data contracts can cover things like:

  • What data is being extracted
  • Type and frequency of intake
  • Data ownership/intake details whether individual or team
  • Required data access levels
  • Security and governance related information (e.g. anonymization)
  • How data ingestion affects any system(s)

As data contracts can differ substantially based on the type of data they refer to, as well as the type of organization they are being used in, we have yet to see a significant degree of standardization when it comes to data contract formats and content. . However, a set of best practices could still emerge in the future, as we’ve seen with the OpenAPI Specification.

Who is responsible for data contracts?

While not necessarily the implementers, the decision to execute data contracts rests with the data leaders. It is worth noting, however, that they require input and buy-in from all stakeholders involved in data consumption.

Data consumers tend to be the most motivated participants, as data contracts clearly make their lives easier. Data producers such as software engineers may need some convincing to show them how data contracts can benefit the organization and improve data quality without much additional effort.

To that end, it’s worth pointing out that data contracts are often evergreen and don’t need a lot of ongoing maintenance. Aside from occasional version control tweaks and updates to contact details etc, they shouldn’t create a significant burden once they’re up and running.

When should data contracts be implemented?

You might assume that the answer to the question of when to implement data contracts would be “the sooner the better”. But let’s say you’re still working to gain organizational buy-in for a Data Mesh approach. Adding data contracts to the mix can complicate things and risks overwhelming stakeholders.

It might be worth making sure you have everything in place – stable, reliable data pipelines that are running smoothly – before diving into data contracts. On the other hand, “if your team is pursuing any kind of Data Mesh initiative, this is an ideal time to make sure that data contracts are part of it.”

Of course, questions like “when should data contracts be implemented?” and “how long does it take to implement data contracts?” tend to have similar answers: in both cases, “it depends”.

In other words, this is not (nor does it need to be) an overnight process. And, when you get started, you can keep things simple. Once you’re armed with the knowledge you’ve gathered from team members and other stakeholders, you can start implementing data contracts.

What’s next for data contracts?

Historically, data management within an organization has always been the responsibility of a dedicated team. Or, in some cases, the mission of just one brave (and possibly overworked) Data Scientist. In such situations, data contracts weren’t really needed to maintain order.

As organizations move towards a Data Mesh approach – domain-driven architecture, self-service data platforms and federated governance – this is no longer the case. When data is viewed as a product, with different teams and sub-teams contributing to its maintenance, mechanisms to keep everything coupled and running smoothly are much more important.

The data contract is still a relatively new idea. They’re an early attempt to improve maintainability of data pipelines and the issues that come with breaking a monolith, so we’ll likely see more iterations and other approaches emerge in the future.

At the moment, however, they might just be the best solution at our disposal to avoid data quality issues arising from unexpected schema changes.

David Matos


Data Contracts – Everything You Need to Know

Free Course Fundamentals of Data Engineering


Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Recent News

Editor's Pick