Gable Blog | What is Data Lineage?

When data becomes more complex, big data analytics also becomes more complex—and, in turn, data compliance and regulations are even more complex.

And this all ripples to affect the complexity of data governance.

While data-driven decision-making actually becomes easier, maintaining competitive data advantages becomes more difficult. Access to quality data is already vital, but understanding how an organization uses that quality data is now essential.

Enter: Data lineage.

What is data lineage?

Data lineage refers to how specific data is used and transformed over time. Data professionals employ data lineage practices to record and share this data use and transformation as it occurs.

Time out: How is data lineage different from data provenance and data governance?

One of the challenges of data management is that you can’t break out aspects of the data lifecycle and affix each to specific periods of time—past, present, or future.

This is why we sometimes find data, especially in large complex systems, to be a perceptual challenge. We can’t cleanly break it down into easily categorical chunks. For this reason, it can seem like all data (e.g., data flows, data assets, data environments, etc.) can, at times, be proverbially everything, everywhere, all at once.

However, with data’s growing role in our lives, it’s important everyone works to understand these fundamental aspects of data management—as our collective ability to ensure data security, compliance, and data-driven decision-making relies on it.

Fortunately, we can use our new book on data contracts (now available in early release from our friends at O’Reilly Media) as an illustrated example of how provenance, governance, and lineage all relate.

Provenance: Data Contracts: Developing Production-Grade Pipelines at Scale has two authors and a unique editorial history. The record of this authorship and editorial history over time is like the book’s data provenance.

Governance: Our publisher, O’Reilly, sets forth specific guidelines and standards—a prescribed framework—determining how the book should be used, secured, and maintained. These guidelines and standards also ensure the level of quality the book must aspire to. As such, this framework acts as the book’s governance.
Lineage: Finally, we want the book to find its way into the hands of people who will love it. Our distribution plan to make this happen can be mapped (i.e., tracked)—clearly showing how the book goes on to reach readers, where it is sent (both on and offline), who distributes it, and how it moves through various systems and processes as a distinct body of work. This map is the book’s lineage, and as such, the map becomes essential for troubleshooting, measuring the book’s impact, and analyzing any dependencies related to it.

While the above hopefully parses the different roles that provenance, governance, and lineage play in effective data management, our hope is that it also illustrates their interdependencies—how each of the three functionalities improves the other two when orchestrated together.

9 Common types of data lineage

Now that we’ve simplified things, let’s go ahead and complicate them again (just a bit).

In data management, there are many different kinds of data lineage. There need to be—because different stakeholders and departments within an organization can use the same data in very different ways. And knowing everything about all data all the time isn’t necessary (despite what that one data analyst two cubicles over would have you believe).

Depending on these varying user requirements and perspectives, some aspects, dimensions, and attributes will make more sense to map in one instance, and less (or not at all) in another. To this, other factors determining what should or should not be mentioned include data complexity, regulatory compliance, data governance, and strategy, among others. This leads us to nine common types of data lineage in use across organizations.

Backward lineage: This type of data lineage traces the data flow from a given point back to its point of origin. Doing so can help data teams verify data quality related to different data outputs (e.g., data sets, reports, analytics results, etc.).
Forward lineage: By tracing data flow from a given point to its final destination, teams can evaluate the relevance and value of data inputs and measure the impact of changes to data or alterations enacted upon it.
Horizontal lineage: The flow of data across different systems, platforms, and applications is mapped using horizontal data lineage—showing how data moves, integrates, and transforms between its source and its target. Mapping horizontal lineage also includes the tools and processes involved in these data flows.
Vertical lineage: Alternately, mapping the verticality of data refers to its flow within a specific system, platform, or application. This information helps data teams understand how data in these locations is structured, stored, and accessed at different layers, including tables, columns, keys, records, fields, and files.
End-to-end lineage: Teams may also map the horizontal and vertical aspects of data together, referred to as end-to-end lineage. This is done when stakeholders need a holistic view of data provenance, usage, and quality.
Table-level lineage and column-level lineage: When a much more detailed form of data tracking is required, table-level lineage—mapped at the table level of databases or systems—provides teams with a detailed, macro view of data flow. Viewing the data at the table level can be useful when performing impact analysis, data integrations, migrations, or compliance and auditing.

However, data quality management, debugging, or detailed impact analysis can require an even more granular view of data. In these cases, column-level lineage illuminates how data moves between individual columns or fields in a table, shows how that data is altered, and provides a view of its dependencies.
Technical data lineage: Zooming back out, teams map technical data lineage to broadly track the flow of data as it moves through technical processes and transformations. The details of data’s journey from data producers to its final destination(s) are key for maintaining data efficiency, integrity, and transparency throughout an organization.
Operational data lineage: Additionally, mapping how data is used and accessed in day-to-day business operations provides another way to track how data moves throughout an organization (albeit a slightly broader one). By focusing on operational systems, teams can conduct business impact analysis, compliance, and reporting, and gauge operational efficiency.
Business data lineage: Finally, data flowing between business processes can also be mapped. Business data lineage takes less concern with the specifics of what is happening to data. Often utilized by less technical stakeholders, this form of mapping tells the story of how data is impacting the business as a whole.

Data lineage best practices: Our 9 recommendations

Settling on the right best practices for data lineage is a lot like settling on the best title for a book on data contracts; both are exercises in limitations and precision. (And another good option always pops to mind the moment you think you’re “done.”)

That said, we think the following nine best practices, as a whole, accomplish two things:

They can be used to support the foundational needs for tracking effective data lineage across most organizations.
They balance these foundational needs with some advanced enhancements.

1. Clarify objectives, tying them to IT and business needs

Establishing clear objectives before embarking on any initiative, data-related or otherwise, is crucial. In this instance, clarity helps data leaders ensure that all tools, policies, and procedures that make up an organization’s data lineage practices will be efficient, sustainable, and aligned with ongoing business needs.

It’s worth considering how data contracts help you get the most out of this process, as the value of ongoing and tangible evidence of how data lineage practices positively impact an organization over time can’t be overstated.

2. Leverage automation to maximize lineage fidelity and scalability

Plan to automate data lineage practices as much as possible. Gaining access to accurate and consistent information is the point, after all. Automation, especially automated data discovery, can play a crucial role here, as it reduces the risks of errors inherent in manual processes.

At the same time, automation promotes scalability—ensuring that data lineage practices (especially those related to metadata capture and management) remain functionally efficient over time.

3. Vet tool and system integrations

For most of us tasked with implementing data lineage practices, chances are good an established data environment will already be in operation. Audit existing data management tools based on established objectives, ensuring they’ll contribute to maximizing the utility of data lineage information. You can then determine if you’ll need to invest in a dedicated data lineage tool or if some combination of existing tools and systems will provide the needed functionality.

While reviewing integration capabilities, you might request demonstrations and trials, assess levels of support from potential vendors, and conduct cost-benefit analyses as needed.

4. Normalize (and standardize) comprehensive documentation

Maintaining detailed and accurate records of data lineage should be the rule, not the exception. This documentation becomes crucial for understanding data flows and ensuring data quality throughout the lifecycle.

Depending on the size of organizations, data teams may also need to be vigilant regarding whether this documentation remains standardized over time. This contributes to robust data governance, guiding consistent understanding and use across departments and teams—reducing confusion while fostering solid communication.

5. Establish robust security and access control

Security has always been critical for protecting sensitive lineage information. But it’s increasingly critical to position data security as the responsibility of everyone, not just those in IT. To this, implement secure communications and help your co-workers understand why, at their most basic level, encryption and secure APIs are being used.

Establish a consistent cadence and maintain detailed access logs to make sure systems and tools are patched and updated regularly. Implement monitoring tools to automatically detect unusual access patterns and ensure the right alerts get sent to the right people at the right times. And, as more of the organization leans in to keep data secure, robust access controls become increasingly important. Consider utilizing security principles like the principle of least privilege as you define and refine which roles and responsibilities get specific data access.

6. Take time to make time for stakeholder buy-in

At this point, you’re more or less ready to approach stakeholders for their support. Make sure you do, and that a tacit buy-in is what you actually walk away with.

Depending on your stakeholders, it may help to engage with them early and often. Make sure you clearly outline and identify the benefits the organization’s data lineage practices will have. Demonstrate how the lineage practices align with the business and set realistic expectations, the latter of which can often be aided through a strategic series of pilot programs.

7. Visualize lineage and support ongoing training

You probably visualized parts of your lineage proposal to help sell it. Carry the visualization forward, representing data lineage in ways that make it easy for employees with different experience levels, skill sets, and backgrounds to understand and digest.

As able, promote training that ensures different users understand how to leverage the tool or tools used to map data lineage. (Note: To promote this training, consider pizza.)

8. Monitor and measure mapping effectiveness

Systematic measurement requires systematic monitoring of how effective data lineage practices are over time. In turn, effectively doing both helps data teams ensure that supporting systems are robust, responsive, and stay aligned with organization governance and management objectives.

Ideally, this monitoring and measurement isn’t limited to lineage alone but functions as part of broader data management efforts.

9. Review and update lineage practices regularly

The tail of the best-practices snake here consists of regular audits and reviews of the data lineage process as it unravels.

This is vital, as it allows data teams to adjust the granularity of data lineage mapping, balancing utility across users and uses while optimizing for evolving stakeholder needs.

A data catalog can also be beneficial here. When embedded with data lineage information, these catalogs make it easier for users to get at and understand the data they need, enhancing overall data management.

Remember that in data lineage, “well-begun” is half-done

The potential impact of a map directly correlates to the quality of information used to create it.

Best practices, in addition to a clear understanding of the concept, certainly make data lineage practices more efficient and effective. But shifting the emphasis and expectation of data quality further left can make them exceptional.

Be among the first to find out the role data contracts can (and increasingly do) play in the lineage of data by signing up for our product waitlist at Gable.ai.

‍

Chad Sanderson

May 21, 2024

What is Data Lineage? Tools, Techniques, Examples

Get the ultimate guide to Data Contracts Deep Dive

Get the ultimate guide to Data Contracts as Code

Ultimate Guide to Data Contracts