rw-book-cover

Metadata

Highlights

  • Lineage has long been a requirement for anyone processing data - whether for complying with regulations, ensuring data reliability or, to quote Marvin Gaye, plainly just knowing what’s going on from provenance to impact analysis. However, our industry has historically had difficulties collecting data lineage reliably. From the early days of lineage powered by spreadsheets, we’ve come a long way towards standardizing lineage. We have evolved from painful, manual approaches to automated operational lineage extraction across batch and stream processing. Now, we’re on the brink of a new era when lineage will be built into every data processing layer - whether ETL, data warehouse or ai - and not an afterthought. (View Highlight)
  • Lineage is not a feature, it is a means to an end. You do need lineage to achieve a specific goal but that goal is often obscured behind an oversimplification that “we just need lineage”. When someone asks for lineage they can mean many different things. Depending on the context the lineage requirement is often multi-faceted, like the proverbial three kids in a trenchcoat. Let’s break it down. (View Highlight)
  • We want to make sure our data is available, updated on time, and correct. Those attributes are not just a quality of a dataset, they derive directly from the process of producing it from other upstream datasets. Problems with data are almost always coming from an upstream dependency. Whether it’s delays in ingestion, bad data updates or changes in how we collect data, anything that changes upstream will potentially impact quality. This is the reason lineage is key to measuring and troubleshooting data quality issues. Guaranteeing data quality also means that quality commitments must be consistent across dependencies. For example, If a dashboard is critical, we must ensure that all datasets upstream from this dashboard - directly or indirectly - must have the same level of quality requirement. A production-level dataset should not have a dependency on an experimental dataset with no on-call rotation or data quality expectations. (View Highlight)
  • Modern data engineering is emancipating ourselves from an uncontrolled flow of upstream changes that hinders our ability to deliver quality data. (View Highlight)
  • Data tends to accumulate and increase in complexity. It goes through several layers of cleaning up and modeling before it becomes a reliable source of insights. For example, you might have many tables named “customers”. It is critical to use the right data source and join on the correct ID. You can not base your analysis on raw data where spam and internal usage have not been removed. You can not copy data containing PII into another table. For all those reasons you need a way to discover everything that exists and also clarify what is usable and how datasets are joined together. You need to be able to verify that data you rely on is derived from the correct layer of modeling. Lineage enables our understanding of where data is coming from and where it is going. (View Highlight)
  • In addition to expecting basic levels of correctness and understanding of our data we are also simply often held accountable by regulators on our practices. Whether it’s to guarantee our user’s privacy by making sure their data is stored and used appropriately or tracking the flow of transactions to guarantee correctness, there are financial repercussions to not meeting the accepted bar of understanding how data is flowing from one dataset to the next. In a word: lineage. (View Highlight)
  • In the olden days, we used to collect lineage manually. There would be a program organizing the collection of lineage. Someone would define a template (most likely this would take the form of a spreadsheet) and then ask various people responsible for the collection, transformation or usage of data to fill in the document. There would be extended communication back and forth to ensure complete coverage. Verifying integrity would be difficult but since the process was defined and we were following it, from a compliance perspective, this was a success. Unfortunately, by the time the process was over we’d have to start again for the next iteration. With more people and more data, this quickly becomes untenable and the collection of lineage takes more time than the frequency of audits allows, creating a significant burden on the organization. (View Highlight)
  • As people suffered the manual toil of collecting lineage, the next step to improve the situation was clear: automation. Any good engineer would look at our data system of choice and figure out that most of the time, the lineage information is already there, latent. It is implicitly encoded in all the SQL queries and other programs accessing and transforming data. (View Highlight)
  • I can reverse engineer all those transformation layers, write a SQL parser here and there, instrument the libraries and automatically audit data access. This is way better than doing it by hand across an entire organization. However, we haven’t yet found our silver bullet. This solution has a couple drawbacks. First, if open source solutions are easier to reverse engineer, proprietary databases are more opaque. Second, we create a dependency on a vast surface area of internal apis that have no guarantee of stability. Every vendor/system has their own way to process data which produces a solution with vast amounts of complexity. This also makes it brittle as it requires constant fixing as those internal APIs change over time. Maintaining those integrations is expensive and the few lineage vendors in this field have been acquired and disappeared over time, making it difficult to rely on as a solution for lineage collection. (View Highlight)
  • As we progressed in our quest for lineage, and met others sharing the same goal along the way, it became obvious that we could all benefit from uniting behind a common solution. By standardizing how lineage is collected, we solve multiple problems ladening our reverse engineering solution. We share the cost with others, creating more value for everyone. But sharing solutions alone doesn’t quite prevent the ongoing maintenance burden of keeping up with a complex and fast moving data ecosystem. (View Highlight)
  • The final step of standardization is to move the responsibility of producing lineage metadata to the producer of data itself. As it emerges as a common need for all data practitioners, lineage becomes a requirement for all data tools, open source or proprietary. Now that OpenLineage has standardized how to represent lineage, there is an easy path to follow for every data processor to support exposing lineage as a built-in feature and not an afterthought. (View Highlight)
  • The dream of standardizing lineage has been steadily happening. We can now see the day will come where lineage is a table stakes feature in every data pipeline. In particular, let’s review support for lineage in key open source projects. (View Highlight)
  • Lineage support in Airflow started as manual annotations. Soon after, OpenLineage provided automated lineage extraction with its Airflow integration. However since it was external to the Airflow project, it would occasionally get broken by changes in internal APIs. As of Airflow 2.7, this is no longer the case as Airflow provides built-in support for OpenLineage. It is now the responsibility of each operator to expose lineage. (View Highlight)
  • Flink is a great example of OpenLineage support for streaming jobs. As a streaming jobs runs continuously until stopped, in addition to the start and complete events, this integration sends events on Checkpoint. It started initially as an external integration that prompted discussions for a more native implementation of lineage in Flink. This effort paves the road for native OpenLineage support. (View Highlight)
  • Companies like Foundational leverage static and dynamic analysis of source code to determine lineage at build-time and enable a better, more streamlined and reliable data engineering practice. These methods can provide lineage across multiple code repositories thus providing visibility to lineage changes as a result of code modifications. Accessing code can also sometimes simplify lineage extraction, from both an operational and cost perspective. (View Highlight)