As businesses embrace digital transformation and migrate their critical infrastructure and applications to the cloud as a key part of these efforts, what can be called “data clouds” have started to take shape. Built on a multi-cloud data infrastructure, such as Databricks or Snowflake’s data platforms, these data clouds enable businesses to break free from application and storage silos, by sharing data in cloud environments over site, private, public and hybrid.
As a result, data volumes have exploded. Big data applications increasingly generate and ingest increasingly different types of data from a variety of technologies, such as AI, machine learning (ML) and IoT sources, and nature of the data itself changes drastically in both volume and form. of datasets.
As data is freed from constraints, visibility into the data lifecycle becomes blurry and traditional quality control tools quickly become obsolete.
Bad data flows as easily through data pipelines as good data
For the typical business, the monitoring and management of data is still handled by legacy tools designed for another era, such as Computer science Data quality (published in 2001) and Talend (released in 2005). These tools were designed to monitor static siled data, and they did it well. However, as new technologies have entered the mainstream – big data, cloud computing and data warehouses / lakes / pipelines – the requirements for data have changed.
Legacy data quality tools were never designed (or intended) to serve as quality control tools for today’s complex continuous data pipelines that carry moving data application to application and cloud to cloud. Yet data pipelines frequently feed data directly into customer experience and business decision-making software, creating enormous risk.
This is just one example. Bad data can lead to incorrect credit scores, shipments to the wrong addresses, product defects, etc. Market research firm Gartner found that “organizations believe poor data quality is responsible for an average of $ 15 million in losses per year. “
Where are the safety inspectors and data cleaners in the pipelines?
As developers rush to tackle the challenges of maintaining and managing data in motion at scale, most are turning first to the DevOps and CI / CD practices they have used to build software applications. modern. However, to move these practices to data, there is a major challenge: Developers must understand that data evolves differently from applications and infrastructure.
As applications are increasingly powered by data pipelines from cloud-based data lakes and warehouses and streaming data sources (such as Kafka and Segment), there is a need for continuous monitoring of the quality of these data sources to avoid outages.
Organizations must ask themselves who is responsible for inspecting data before it reaches data pipelines, and who cleans up the mess when data pipelines leak or feed bad data into mission-critical applications? Today, the typical business approach to pipeline issues and outages is purely reactive, scrambling to find fixes after applications crash.
Why Businesses Need Data DevOps
If today’s typical, data-driven multi-cloud business hopes to evolve data platforms with agile techniques, DevOps and data teams should look at their own evolution to help them plan. for the future, noting in particular a missing component when porting DevOps. to data: Site Reliability Engineering (SRE).
DevOps for software has only been successful because a strong safety net, SRE, has matured alongside it. The discipline of SRE has enabled organizations to monitor software behavior after deployment, ensuring that production applications meet SLAs in practice, and not just in theory. Without SRE, the agile approach would be too risky and error-prone to rely on business-critical applications and infrastructure.
Modern data pipelines would benefit from something similar to SRE: Data Reliability Engineering.
Some organizations are already performing development / milestone testing on their data software, but standard development / milestone testing is hardly a quality check for big data on the move. Data has characteristics that make it unmanageable by traditional testing practices. For starters, it is more difficult to test the data because the data is dynamic. The data that will flow through your pipes – often generated by applications ingesting real-time information – may not even be available during pipeline development or deployment.
If you rely on development / milestone testing, a lot of bad data can flow through your good data pipelines, leading to crashes and errors, but your quality control tools won’t be able to spot the problem until long after something is done. goes wrong.
Your tests may tell you that the data sets are reliable, but that’s only because you’re testing the best samples, not the real-time data streams. Even with perfect data processing software, you can still end up with stray data flowing through your pipeline, because the right pipes are not designed for quality control, but just for flow.
Getting started with Data DevOps and Data Reliability Engineering
Developing Data DevOps capabilities should not be a burden for organizations that are already adopting agile and DevOps practices. The trick is to carve out new roles and skills to suit the unique characteristics of sprawling, ever-changing, big, cloud-ready data.
However, if you follow the six steps below to lay the groundwork for proper quality control, your organization will be well on your way to reigning over data that’s out of control.
1. Embrace Data DevOps and clearly define the role
Modern data presents different challenges than legacy static data (and the systems that support it), so be sure to differentiate Data DevOps roles from closely related ones. For example, data engineers aren’t quality control specialists, and they shouldn’t be. They have different priorities. The same goes for data analysts and other software engineers.
2. Determine how and where ERDs will fit into your organization.
The DRE should work closely with the DataOps / DevOps teams, but the role should be created within the data teams. To ensure continued quality, the DRE must be involved in all key stages of the data creation and management process.
3. Provide the tools that enable your Data DevOps team to succeed.
Data DevOps should have its own set of tools, expertise and best practices, some drawn from related areas (e.g. software testing), others developed to address the unique challenges of high volume moving data and with high cardinality.
4. Determine how to create and maintain quality checks and controls
Many data quality programs fail because the legacy and in-house tools used to create quality checks struggle to cope with complexity. Thus, these tools are themselves complex, difficult to use, and end up being shelving. It is essential to think through the process of updating and maintaining data quality controls as they evolve, relying on intuitive tools that make work easier.
5. Start the mapping processes
Remember to map the processes as the Data DevOps team evolves. Make sure your Data DevOps team knows what to do in the event of a data failure. The DRE may need to bring in other experts, such as data engineers, data analysts, or even business stakeholders, who can interpret the data and separate legitimate changes from quality issues.
How will this escalation process work in your organization?
6. Paint a clear picture of what successful remediation looks like
Big Data remediation is a unique challenge. With dynamic data, some types of correction just don’t make sense. For example, if you fix issues that caused HTTP requests to fail or slow page loads, those sessions are lost.
What does successful remediation look like and how do you know the problem is now under control?
Conclusion: Modern data-driven applications need Data DevOps to ensure the reliability of critical data
Modern, cloud-based, data-driven businesses need high-quality, reliable data to achieve their business goals. However, the complexity of data in modern environments means that businesses need DevOps not only for IT and applications, but also for data. Your Data DevOps team and your DREs will have a lot in common with traditional DevOps and SREs, but Data DevOps is going to be a discipline that needs to approach continuous data quality with fresh eyes, figuring out how to fill dangerous gaps in the data cycle. life, especially with regard to data reliability and quality.
For most businesses, the next step in taking control of your data is just to take one step, any step, towards ensuring continued quality. Make data quality control a priority, embrace Data DevOps and start determining how these new capabilities will fit into your existing DevOps, data and test teams, and you’ll be miles ahead of your competition.
About the author: Manu Bansal is co-founder and CEO of Lighting data. He was previously co-founder of Uhana, an AI-based analytics platform for mobile operators that was acquired by VMware in 2019. He received his PhD. in Electrical Engineering from Stanford University.
What we can learn from famous data quality disasters in pop culture
Illuminating the GIGO Data Problem
The real-time future of ETL