The concept of data lineage is not new. In the early data processing days, it took the shape of flowcharts and data flow diagrams. With the advent of Microsoft Excel, it manifested itself in spreadsheets. In the contemporary world, data lineage exists in many software products that move, transform and manipulate data.
Data lineage is defined as a data life cycle that includes the data’s origins and where it moves over time. It describes what happens to data as it goes through diverse processes. It helps provide visibility into the analytics pipeline and simplifies tracing errors back to their sources. wikipedia.org
If I take a utopian view of data lineage, I can click on a report data element and show its journey through my data world including its manipulations and transformations in shape and form. Many a salesperson is happy to showcase their product on their self-contained slice of the data world. That is not reality in a corporate environment with many platforms, databases, data sources, and hard-coded legacy apps. Here are five challenges that make that utopian view of data lineage just a bit harder to reach.
Data Lineage Challenges
- Everything is not a bright shiny tool.
Data and applications are old in a mature organization. Tracing lineage through legacy apps with hardcoded transformations is nearly impossible. Add into the mix cowboy MS Excel apps that are unknown or know by a select few. Older and “boutique” data structures and databases are strangers to new technology. These are a few of the challenges that result in a lack of asset interconnectivity and the ability to talk to child or parent assets in the lineage. My world is that diverse set of technology, platforms, and structures. That means that data lineage needs to have a manual component and does not happen in a sexy sleek automated means.
- Bodies, you need bodies.
Everyone loves data today. Unfortunately, not everyone agrees on what resources it takes to manage data. The best-architected data lineage plan can fail if the resources, i.e. people and money, are not allocated. Resources become even more critical in environments where data lineage is a manual process, i.e. spreadsheets. I can almost guarantee that most of my readers work in that environment where data lineage requires manual intervention. Making sure that management knows this and understands the costs and benefits achieved through data lineage can be a hard sell and difficult job.
- It’s not just about what you need.
Managing data lineage takes more than a metadata administrator. Developers, DBAs, data architects, ETL and BI developers are just some of the roles that produce and use data assets. Data lineage takes rigor, standardized practices, and enforced standards. If these data consumers and producers don’t see the value of data lineage, they most likely will not follow the carefully crafted data lineage processes. These consumers and producers need to look past their slice of the enterprise and understand how the enterprise benefits from data lineage.
- Your silver bullet is not mine.
Do you know what data lineage means to your customers? Is your definition is the same as theirs? Data lineage is costly to manage. It is important to understand at what level your audience needs data lineage. It’s likely that there are several views of data lineage needs based on the employee’s role and job function. The cost of that utopian data lineage may not be justified by your consumer’s needs. Craft the data lineage solution that delivers the most value and is right-sized for your consumer. Just about any data lineage solution can dive deeper into the detail of lineage in future iterations. Never implement a data lineage plan that cannot be managed with the resources (people and monies) you are given.
- But this is hard!
That MS Excel spreadsheet looks like a simple solution for launching a data lineage program. There are just a few columns for the source and target assets. First, understand that a spreadsheet is a tool you use for data lineage. The processes and standards surrounding data lineage are what gives it credibility and usability. Over the years, I have seen different teams use spreadsheets to document their data lineage. The result is a collection of spreadsheets, usually lost in shared network folders. They almost immediately become out of date. They don’t tackle the hard part of linking data assets across spreadsheets. Even more common, they start or stop at data assets that they do not understand, are outside of their control, or are unknown to them. The rigor that brings these assets together and paints the complete picture is hard and cannot be overlooked. Shortcut the hard stuff and you will most likely lose support and confidence in your data lineage efforts.