Shifting Left with Data DevOps | Chad Sanderson | Shift Left Data Conference 2025

Talk Description

Data DevOps applies rigorous software development practices—such as version control, automated testing, and governance—to data workflows, empowering software engineers to proactively manage data changes and address data-related issues directly within application code. By adopting a "shift left" approach with Data DevOps, SWE teams become more aware of data requirements, dependencies, and expectations early in the software development lifecycle, significantly reducing risks, improving data quality, and enhancing collaboration.

This session will provide practical strategies for integrating Data DevOps into application development, enabling teams to build more robust data products and accelerate adoption of production AI systems.

Additional Shift Left Data Conference Talks

Shifting Left with Data DevOps (recording link)

Chad Sanderson - Co-Founder & CEO - Gable.ai

Shifting From Reactive to Proactive at Glassdoor (recording link)

Zakariah Siyaji - Engineering Manager - Glassdoor

Data Contracts in the Real World, the Adevinta Spain Implementation (recording link)

Sergio Couto Catoira - Senior Data Engineer - Adevinta Spain

Panel: State of the Data And AI Market (recording link)

Apoorva Pandhi - Managing Director - Zetta Venture Partners
Matt Turck - Managing Director - FirstMark
Chris Riccomini - General Partner - Materialized View Capital
Chad Sanderson (Moderator)

Wayfair’s Multi-year Data Mesh Journey (recording link)

Nachiket Mehta - Former Head of Data and Analytics Eng - Wayfair
Piyush Tiwari - Senior Manager of Engineering - Wayfair

Automating Data Quality via Shift Left for Real-Time Web Data Feeds at Industrial Scale (recording link)

Sarah McKenna - CEO - Sequentum

Panel: Shift Left Across the Data Lifecycle—Data Contracts, Transformations, Observability, and Catalogs (recording link)

Barr Moses - Co-Founder & CEO - Monte Carlo
Tristan Handy - CEO & Founder - dbt Labs
Prukalpa Sankar - Co-Founder & CEO - Atlan
Chad Sanderson (Moderator)

Shift Left with Apache Iceberg Data Products to Power AI (recording link)

Andrew Madson - Founder - Insights x Design

The Rise of the Data-Conscious Software Engineer: Bridging the Data-Software Gap (recording link)

Mark Freeman - Tech Lead - Gable.ai

Building a Scalable Data Foundation in Health Tech (recording link)

Anna Swigart - Director, Data Engineering - Helix

Shifting Left in Banking: Enhancing Machine Learning Models through Proactive Data Quality (recording link)

Abhi Ghosh - Head of Data Observability - Capital One

Panel: How AI Is Shifting Data Infrastructure Left (recording link)

Joe Reis - Founder - Nerd Herd Education (Co-author of Fundamental of Data Engineering)
Vin Vashishta - CEO - V Squared AI (Author of From Data to Profit)
Carly Taylor - Field CTO, Gaming - Databricks
Chad Sanderson (Moderator)

Transcript

*Note: Video transcribed via AI voice to text; There may be inconsistencies.

"Alright. Hey everybody. Good morning. Depending on where you are, maybe a good evening if you are in somewhere that's not in the Pacific Northwest or Eastern part of the United States. Um, it's good to see everybody.

This is obviously our, our very first shift left event, so we are super duper excited for that. Um, first, hopefully not the last. Uh, we, we got the idea of shift left. Uh, I think a, a few months ago when we were talking to some of our customers at Gable and other folks out in the broader industry and we, we essentially realized that, um, there was a lot of conversation about what shifting left meant, about what data contracts are, about how you sort of move data quality to the source.

Uh, and we were seeing really awesome things with our customers. Um, and we were also hearing great things about what was happening in the broader industry. But I think the, the industry at large, so the meta industry, uh, wasn't getting a lot of those learnings. So I do have some slides that should be coming up in just a second.

Um, but really what I'm gonna be focusing on today is the, the start, at least the, the start of that journey from, from our perspective. So what has Gable been thinking about over the last two years? Where have we seen companies make really interesting investments into shifting left, and where is it starting to pay off?

So cool. Um, I'll start sharing the stuff right now. Alright. We've got sharing coming up.

So yeah, it should be a really fun day today.

Great. Well, let's get a cracking. Okay. This is first presentation of the day Data DevOps, the heart of shifting left. Let's get into it. So maybe a little bit about me before we start. Uh, if you don't know me, my name is Chad Sanderson. I am the CEO of Gable ai. Uh, you might have seen my content on LinkedIn every now and then, or maybe Substack.

Um, I've been in the data infrastructure space for a pretty long time. I worked on the AI platform team at Microsoft. I've led data teams at large enterprise companies like Sephora and Subway, worked at a startup. Um, and now I'm, I'm running my own company because I, I do think there are still a lot of unsolved problems in data infrastructure and, and data management.

Like Mark said, we are working on the O'Reilly book on data contracts, which really should be the sort of definitive, authoritative guide on how you implement data contracts in a production environment. Um, and there's been a lot of lessons as we've a, written that book and b have tried to help companies navigate the political cultural process and technical challenges that come along with the data contract implementation.

So if we could go ahead and move to the next slide. This is, uh, if, if you've been living under a rock for the past two or three years, maybe you haven't observed the potential impact that AI is going to have on our space. There's the broader landscape of technology, but then there's the data industry and there's been some claims that are very wild and outlandish and some that are a little bit more practical.

But I think one way or the other, hopefully everyone can see that data is an absolutely instrumental part to any artificial intelligence initiative. The two are tightly coupled. You can't do one without the other. The challenge that I've seen working with everything from large scale enterprises in the financial services space to earlier stage startups, I.

Is that doing a POC for AI is one thing. Um, having something run in a test environment that's not actually productionized is one thing, but when you start trying to scale the application of AI to real production grade use cases, it has to work consistently. There needs to be quality. You need to enable many parts of the business to explore and utilize AI together.

You start running into serious data management problems. And it's the same category of data management problems that most teams for the last 20 or 30 years have been dealing with in the structured data space with our more traditional data pipelines. So my belief is it's going to be exceptionally hard for AI to be adopted at scale for the most important use cases in the world, unless you have a strong framework for data management.

And I think data management today is broken. Let's move along. Okay. So why is data management broken? Maybe before I start, I should briefly describe what I mean by data management. I think data management is the broader category that includes, uh, data governance compliance, data security, and data quality.

Really anything that you're doing, uh, with data or with metadata beyond its actual analysis movement, uh, compute falls into the, the category that I would call, uh, data management. And so there's the sort of three core problems, uh, or, or three categories of problems that we see come up over and over again.

One is there's just outages all the time. Um, things are changing in upstream systems and breaking downstream systems. Things are changing in brown downstream systems and breaking pipelines. Pipelines are changing and because of all of this change, it means that your consumers who are building high value data products off of your data are oftentimes broken and end up not being able to trust the data that they're leveraging.

And with a lack of trust means you're spending a lot more time on validation and testing, and you need people to do that. So you see these teams get very, very large that are spending 75, 80% of their time or more on the testing and validation instead of, uh, creating value for the business. The second big problem with data management is missing requirements.

Meaning you have new features that are being written, new AI systems that are being built, and the data teams don't learn until later what those systems actually need from a data perspective in order for the business teams to evaluate them and understand if they were a successful implementation or not.

And this creates a tremendous amount of churn, as you might imagine. If you're shipping a feature and that feature doesn't contain the data that the product team actually needs to evaluate it properly, then you have to wait until the software engineer who wrote the feature has enough time to go back and add in the metrics and add the instrumentation.

And that just extends the amount of of time it takes to understand if that feature is actually working or not. And then the final point is the complexity of our software ecosystem and how that's evolved over time. So data is the communication mechanism between technologies. Uh, APIs are moving data around.

Data pipelines are obviously moving data around, um, events and logs. All of these things are data, and this is sort of the connective tissue between different services and systems within your business. And over time, teams kind of just build whatever they want, especially in a microservice based environment, you get a tremendous amount of spaghetti code and that spaghetti code affects the data.

So it's very difficult to understand where is a particular piece of data going, who is consuming it, and why is it valuable? This is the primary reason why migrations take an exceptionally long period of time. It's not because building new services are all that difficult. It's that you might have a thousand different use cases for that data.

You need to track down every single one of those use cases and figure out how to onboard them to whatever the new system is. This is a very, very difficult process, and because it's so hard, it leads to a lot of initiatives simply not happening at all. Let's go on to the next slide. Okay. So what are the root causes of some of this pain?

Well, a lot of it frankly starts and ends with what's happening in the upstream systems. If your software engineering teams are not treating data as a part of the product with the same level of quality and care and compliance and governance, then you're gonna start to see things suddenly change for no reason.

Suddenly break for no reason. You're not really gonna understand where the data is going. You're not gonna have clear unit testing and integration testing in place like we see in software. And then downstream systems are gonna be built on top of that rapidly changing environment. To account for the fact that you don't have this quality coming from the source.

And that's a very, very difficult thing. So there's sort of a whole litany of problems that I think that start with the upstream systems, um, and are really rooted in engineers not having a complete stake in how that data is used outside of their transactional or operational systems. The ultimate result of all of this is that quality is shifted to these siloed data teams, the teams that are thinking about data management every day, which might be governance or data compliance, or data security, or the data platform teams.

And unfortunately, these teams simply do not have the power to exert quality across the entire system the way that they need. And that leads to an enormous amount of, of ad hoc and sort of hacky solutions. Next slide. So a lot of folks might think that when you add AI into this equation, you, you rub the magic lamp and a genie comes out and it makes all the issues go away.

This is not true. In fact, AI is going to make these types of problems much worse. Let's go on to the next slide. And the, the reason why it makes it much worse, frankly, um, is because AI doesn't understand any system outside of sort of the local changes it's making. So if you point an AI at a repo and you tell it to go do something to write some code, it doesn't know the implications of how changes to the data affect others in the data ecosystem.

So you've got a lot of agents rapidly changing code all the time and not understanding contextually what they're doing. And that causes massive impact, massive negative impact both on the AI side and and non-AI side, where you've got multimillion dollar outages, you've got migrations that take multiple years, uh, regulatory and compliance issues.

You've got these AI initiatives, like I said in the beginning that are just fundamentally unscalable. You've got old stuff. There's so many businesses that have servers, databases sitting on-prem in like COBOL or something like that, that have just been there for years and they want, people want them to go away.

But if you don't know where that data is going and you know, you don't know if critical systems depend on it, it's very, very hard to to, to deprecate. And I think finally, so many data teams are in the business of managing cost, right? That's actually their job is how do I keep costs low? Whereas if you look at most software engineering organizations, they're not thinking that way.

They're thinking about value. If I can go out and build a new service that makes my company a million dollars or $10 million or a hundred million dollars, then as long as cost is kept to like some marginal fraction of that, then then it's fine. And these are all problems I think most data teams face today.

Let's move on. Okay, so finally we get to the topic of the day. Shifting left, what does it mean? Um, the simple definition is that in data management shifting left means proactively enforcing quality governance and compliance earlier in the developer lifecycle. So when software engineers are writing code, when they're building out their systems, they should be thinking about data management at that point instead of reactively once problems in the data already exist.

Next slide. So if you're a software engineer, or you've been around software engineering for some amount of time, you've probably heard of the 1 10, 100 problem. What this basically means is that if you encounter a, a quality issue and you detect it at design time, so before any code is actually shipped into production, it costs about a dollar to fix.

It's, it's very cheap if you catch it when you've shipped it into production, but it hasn't affected a customer. It's about $10 to roll back. And then if you catch it after it's already impacted a customer, it's a hundred dollars. That's obviously when it's the most costly. So the closer to production, you're catching quality issues, the more expensive it is to actually remediate.

Let's go on to the next slide. So what does that mean? How does that translate to the day-to-day work that most data teams are, are doing? Well, if we were to frame those sort of three categories of quality resolution, you could frame it as $1 as a design time fix, meaning you're catching issues in either in CICD or even before that.

Uh, a a $10 cost to fix will be at, at runtime, meaning it whatever code it is that's causing the issue has been shipped, but the data problem hasn't actually affected a customer at you detected at runtime. And then finally, is you're detecting it at at consumption time. And that is when your customer is looking at a dashboard.

When they see that their AI model is making incorrect predictions, that's when they identify that there's a problem that is the most expensive. And today, in the data world, most of our resourcing, in fact, I would say close to 100% of our resourcing goes into consumption time data management. We're looking at the contents of the data itself after it's been produced, after it already lands in a downstream system where someone is being impacted.

Let's go ahead and move on to the next slide. When you start to shift left. You completely invert the way that quality is managed, and this is how I think data quality will be managed over the next five to 10 years as the use cases of data grow significantly more valuable. Once it's not good enough to catch 100% of the issues reactively.

When you have an AI system that's generating hundreds of millions of dollars for a company, it's not gonna be good enough to respond to issues that late. And so inverting the system is going to mean flipping the majority of our focus to that preventative proactive design time quality resolution. A uh, a percentage of that focus will shift into runtime resolution.

And then we'll, we'll continue to have, I think, a large portion in consumption time. Um, and if, if I were to sort of reflect this back to the software engineering world, you could, you, you could sort of, uh, uh, uh, GitHub and, and Datadog I think are, are really good mirrors of this. You need both solutions.

But I, in the data world today, it's sort of like running Datadog with no GitHub, right? Think of how chaotic that would be if every potential quality issue, uh, every potential failing test was caught in production after it's already been deployed, it would be extremely expensive and it would put a huge amount of pressure on the software engineering teams to resolve these issues very, very quickly.

So I think that we need GitHub, we need Datadog, we need these systems to work harmoniously together, and that is what requires a shift left. Let's go on to the next slide. All right, so how do you do that? What are the new terms and the new technologies that we should start adopting to make this transition possible?

Well, the framing that we use here at Gable is something called data DevOps. And Data DevOps is all about a design time data quality and data management. Uh, b, by shifting towards the code C using the tools and workflows and systems that software engineers are already familiar with. So if you can do those three things consistently, you can do them effectively and you can do them easily.

You can start to push quality towards a workflow that any engineer in your company is probably gonna respond really well to and could be very excited about because it solves a lot of their problems as well. Let's move on to the next slide. So this is a big chart, and I'm not gonna go over this chart.

Uh, if you're listening to this presentation now, you're, you're gonna get the recording, so feel free to to pause it and, and review it then. But this idea of different categories of data management is really essential to shifting left. So I just wanna explain this slide real quick. So, on the left hand side, sort of our first column is something that I call a pattern.

Now, what a lot of companies do is that they assign themselves, you know, they're like, we're a monitoring company, or we're a data cataloging company, or we do data lineage. All of these, to me, are horizontal cross-cutting patterns that are going to apply to different personas of users. And they will be, uh, the, the actual application and technical implementation of each of those patterns is gonna be different depending on which user's needs you are trying to serve.

So the first sort of category, the first big persona is the, the business users, right? A business user is, uh, maybe a product manager, maybe someone who wants to understand how one business process ties to another business process. And if you looked at more of your legacy data catalogs, uh, this is really what they were designed to do to sort of enable the business user to understand how their business operated, uh, what's the metadata about their business domains and so on and so forth.

The second category is your observability category. So that's where we're focused on the contents of the data itself. This is also really, really important. You need to be able to understand what, what are the records, what are the tables. Uh, what are the, what are the files? What are your metrics? What are your dashboards?

What's sort of like our physical representation of the data? And there's all types of interesting questions that you're gonna ask about that. And then on the left hand side, there is your data DevOps. That's your code. And that appeals to your software engineering organization. A lot of times teams will come to me and says, Hey, I've rolled out my catalog, or I've rolled out my, my monitoring tool, or whatever it might be.

And, and it, it works really great for my, for my data team, but I can't get my software engineers to use this stuff. And the primary reason that is, is because software engineers don't care about data. That's not how they think. They are not looking at, you know, rows and tables and columns. They're thinking about the code that they are writing that is populating their transactional database.

And so if you were to ask them, Hey guys, how can I get you to care about the data that you're generating? They would probably give you something like, I'm, I'm showing you in this list here, right? Where, where is my code that actually produces data? Where, where is that code being pushed to? Where is it being received?

What's the ingress point? What is the ingress point? Who is the owner of that code in GitHub? How has it changed over time? If something breaks, is it easy for me to root cause it? And so on and so forth. So this is the framing that I would recommend. When you think about shifting left, right? It's really starting from that right hand side, which is the business user and moving gradually to the left.

And that gives us really holistic data management. Let's move on to the next slide. Okay. So this is not just a theoretical thing that Chad is making up and talking about for the 50 11th time. This is something that companies are actually doing in the world. It is working and they're seeing some significant value from it.

Uh, Gable in particular has done a lot of work with, with Glassdoor and with another company on this list called Grab, uh, implementation shifting left towards the code and seeing software engineers actually adopt data quality data management and other forms of data ownership. But by no means are those the only companies that are moving in that direction.

So, um, there is a precedent for this, and you're gonna hear from some of those great companies today. Next slide. All right. So what is the biggest challenge in shifting left? Well, I can tell you it's not a technology challenge. It is a adoption challenge. How do you actually get the engineering team, the people who are sitting on the left to do all of this work?

A lot of companies that we talk to might say, Hey, Chad, look, the people theoretically agree with you. They conceptually agree with you, but we don't know how to actually make it happen. How do you help every single engineer in a business simultaneously take ownership of their data in a way that makes sense for them and that drives the correct incentives?

Let's go to the next slide here. There's a lot of hard problems that you need to solve around adoption. Like this is not an, an easy issue. So the, the first big problem is, is the incentive. If I'm a software engineer, why should I take ownership of my data in the way that data teams want? It's a lot of extra effort, it's a lot of extra time.

I don't really manage data, I don't really think about other teams like that. Why should I actually do this? Why does it matter to me? Why does it matter to, to my team? The, the second big problem is, let's say that you do implement some form of a, of a data contract. Maybe it's a, a document, um, maybe it is a, it is a YAML spec somewhere.

How do you keep it up to date as the business is changing all the time? Because the, the right side might change. You might get new business requirements. You might, you might have new data quality rules that are important, but the left side might also change. The software is constantly changing and keeping these contracts up to date is a very difficult and manual task.

And anything that's manual and difficult reduces the likelihood of adoption, dependency management and change Communication is also a big issue. This is something that software engineers care about. If I'm gonna shift left, I need to know when I make changes to things, who am I impacting? How am I impacting them?

Um, who do I need to communicate to? And how difficult is that process of communication? Then there's alert fatigue, and I think anybody who's rolled out a data quality tool is probably very familiar with alert fatigue. Um, it basically alert fatigue means if you trip too many alarms, if you send too many alerts, it just becomes noise.

And the people that you wanna care and that you want to take action, stop paying attention. And this is sort of a worst case scenario in a shift left style environment because if our software engineering friends stop paying attention and stop doing what we want, the whole concept sort of falls apart.

And then finally, you have differences in in language. And I do mean that somewhat literally. If you've got an engineer who's writing code in, I don't know, Java, and that data is being pushed into S3, and then from there it's hitting Snowflake, it's hitting Databricks. So you've got some spark, you've got some sql, and you've got an analyst or a data scientist doing transformations in languages that they understand.

How do these two groups of people communicate with each other about what should happen to the data, given that the underlying systems are very different. So there's a whole set of challenging adoption problems here. Next slide please. So what is Gable? What are we focused on? Foundationally? We are a data DevOps platform and our goal is to make shifting left easy.

We are the adoption platform. We are not bringing in some brand new technology that says, Hey, you're gonna. You roll out Gable and you can replace all these people in your organization. The question is, how do you reduce the barrier to entry for every stakeholder in the data value chain? So that shifting left becomes something that is easy and obvious.

Next slide. So here's how we frame it, is we sit in the data DevOps category. We focus on code, not so much data, not so much business context. These are absolutely critical systems, and they need to work together to form that holistic data value chain that I mentioned before. But we focus on the code.

That's our area of expertise. Next slide, please. And how do we do that? Well, Gable borrows techniques from the security space. So security has spent the last 30 years figuring out how to understand code and how to extract metadata from code for security and privacy based reasons. So where is my sort of sensitive data flowing from?

And two, um, where could there be potential issues of, uh, of, of hacking? Where could fraud possibly be present? Uh, there's a, there's a lot of, uh, literature sort of that's been built up around this. Gable is trying to focus all of that literature towards data management. So how do you take the same techniques of evaluating code and extracting information, understanding where data is flowing across the system and when it changes, what are all the potential breakpoints, both within a single repo, across repos and across technologies, and then providing that context to both sides.

Next slide. So here's a little bit of how that actually happens. The first real core of what Gable is doing, we call contracts, is code. So data contracts, writing a book about it. We think it's super important. There's different types of contracts. There's contracts that focus on the data. There's contracts that focus on the code that produces the data.

This is where Gable's focuses. So it's taking that contract, taking the expectations of what the data should be, translating it into something that an engineer can actually maintain and own and run as a unit test and as and as a integration test. Next slide. And so here's sort of a practical example of what that looks like.

How the Gable system is operating under the hood, what we're doing. So, and on the right hand side, you can imagine we've got a contract and a contract sets the expectations for the data. I always expect to see these two fields that have this particular data type. And if I don't have that, it's gonna be a really big problem for anyone downstream that's consuming that data.

What Gable would do is we would say, okay, let's take a look at the code where that data is being produced. When a PR is being opened, we'll do an evaluation of that code. We know that, that those values with those data types are being generated. Are those being changed in some way? And if they are, we can extract that metadata to do a comparison.

And once we do the comparison, we can determine if the data contract is being violated or not. And if the, the data contract is being violated, we effectively can run as a unit test or an integration test and block that code change from being deployed. So the developers are getting feedback in the language that they understand, in a context that they understand, they're able to iteratively, uh, you're sort of able to iteratively build data literacy into the system without having to do much work, right?

You're training your engineering team what data is important and what they need to do to take ownership of that data. Next slide please. So we're not gonna go into sort of the full scope of what Gable does, uh, because it's a lot, but that what I just showed you is really just the tip of the iceberg. We can do a huge amount of code evaluation and extract some really, really interesting information like ingress and egress points.

Where does data flow into a code base? Where does it flow out of a code base? What is the path between the ingress and the egress? And how is that data transformed along the way? Is it a masking function that's being applied? Is there encryption that's being, uh, being applied? Not only can you do this deterministically, but you can also do it probabilistically as well using artificial intelligence.

So you can extract some really meaningful information about what is the transformation that's actually happening in the semantic sense. And this gives you a huge amount of information, both from like a lineage point of view, but also from a data management data quality point of view as well. Next slide please.

What that ultimately gives you is what we call true end-to-end lineage. And this is outside of your data warehouse, outside of Snowflake, outside of Databricks, how does data flowing from one API flow into another API into an S3 bucket, into a a Mongo DB database, ultimately into, uh, your downstream, um, applications.

It's impossible to extract that lineage. If you're only looking at the contents of the data itself. There's no, there's no way to do it. Um, because the data doesn't give you information about the underlying system that is producing it. So this is what we've been working so hard on a gable for the last two years.

Uh, it works. We have customers who are using it, and it's very exciting. It's very exciting. We kind of get to talk about it publicly for the first time. Um, next slide please. Um, so on top of all of this infrastructure that allows us to reason about code, allows us to reason about data. Uh, we also have really strong, uh, data management workflows.

So once you can identify where is the code in my system that is producing data, where does it come from? Who owns it, and what is going on, what is going to happen at the moment that it changes, you can start to layer in centralized governance systems. Any time that we add a new front end event to our product, we should do X.

The data engineering team needs to be involved. The data platform team needs to get sign off. You need to structure your PIIA certain way, you need to spin up a certain type of pipeline. All of these things can be derived if you understand what's being produced at the moment that a change occurs. Next slide.

Next is, is governance. So by looking at the code, you get some really interesting information. Not only do you understand what is the data and where is it going, but you also understand what type of data it is. Is it personally identifiable information? Where is that PII actually going? Is it being pushed into a system that we all agree is compliant?

Is it flowing into Google analytics? Is it flowing into Stripe? Is it flowing into Okta? Is it flowing into Snowflake? Who should actually have access to that data? Are we treating all that PII the same way? There's so many situations where teams might have policies in place for managing sensitive customer data in one part of their stack, but lose sight of it in other parts of that stack.

If you're actually able to trace how the data flows across systems, then you can ensure that those policies remain consistent. Next slide please. And then finally is guard. So once you have this foundation of understanding where data is going, who's creating it, what are the policies that you need to layer on top, then you can decide when do I wanna jump in and block something from being deployed if I know it's gonna break another person in my organization?

And that's impact analysis. So you are effectively figuring out when a change occurs. What will be the inferred impact of that change across the organization? Both on downstream data teams, machine learning teams, data science teams, finance teams, but also two other software engineers in the organization.

So one example of this, and this is very common, is you have a backend engineer that maintains an API. That API is consumed by a frontend engineer. They make a change to the API. It breaks the front end right now, that's not something that most data teams would look at as a data quality problem or a part of their remit, but it absolutely is a data quality problem.

It's a schema change. And when you start to think about quality and governance in that way, it becomes very clear that this is not just a data team problem. This is a business problem and it's something that data teams can own and use to drive governance across the entire organization in a more holistic way.

Next slide. So what's Gable all about? Well, Gable is all about allowing developers to manage their data, like their code base inside the tools that they already use. Um, so that's the story of Gable. That's what we've been building for two years. Uh, I really, really strongly believe that shift left is going to be a necessity in the next five to 10 years, especially as this AI era comes to pass.

And I think the teams that succeed with data are not gonna be the ones trying to apply more centralized data management methodologies to systems that are becoming increasingly more federated and and fragmented. Next slide. That's my time. Uh, thank you everybody. I hope you enjoy the rest of the conference.

We've got some seriously awesome speakers and panels lined up for the rest of the day, so can't wait to to see you in a see you in a bit. Don't go anywhere yet. Chad, we got some questions coming through here and I would love to get your take on some of 'em because Woo. First of all, chat. Great questions.

I gotta jump into the first one. Do you see much of a role for code gen solutions as a way to drive adoption of data contracts? We've trialed a couple. Code Gen solutions not been that impressed, but the data producers we worked with seemed pretty enthusiastic about 'em.

Did you hear me on that one? I just realized Chad, Chad can't, can't hear me, unfortunately. Let's try that again. I I I'm gonna give Chad the headphones. No, there we go. Chad. Yeah, I was like, all right. Am I gonna give, am I someone gonna ask me the questions? When, when are those coming, man? Sorry, go ahead.

You're keeping a straight face and I was just rattling off this question. Oh, Marcus, it's not coming through here. Lovely. Well, everybody that is in the chat hopefully is laughing with me right now. Not at me. Um, let me know when you're here, Chad, and if I need to, of course. The not working on that one and one moment.

Well, while we're doing this technical difficulties, hold on, let me see the headphones real quick.

You guys mute yourself. Well de Rios you, uh, you look good.

Technical difficulties. Play us a song, man. We're having technical difficulties. Technical difficulties. Why did we think it was gonna be so easy? Okay, there we go. Now I'm, now I'm hearing sound. Okay. Alright. All right. Alright, we're back. We're back. So let's get rocking and rolling on this first question.

Do you see much of a role for code Gen solutions? As a way to drive adoption of data contracts. We've trialed a couple of code gen solutions and not been that impressed, but the data producers we work with seem pretty enthusiastic. Yeah. Um, I think code gen for data contracts is gonna be a bit hard told, there is more foundational infrastructure in place.

And the reason why is that I, I think creating the contract is pretty easy, frankly, right? Like it's just, at least in my opinion, it's just a, a file and you can, you can derive that file, uh, just by looking at whatever the, uh, uh, the event payload is or, or what the database structure is. That makes a lot of sense.

It doesn't take too long to actually put together. Um, so I understand that it, it could save a little bit of time from that perspective, but the, the difficult thing in a data contract, and if you talk to, um, QA teams or, um, test automation teams, they'll, they'll tell you the same thing because they work with contracts as well, service contracts.

Um, the hard thing is figuring out what is the contract that makes sense for the producer and the consumer, right? So the consumer has to actually be able to communicate to a producer, here are the things that I'm using in my contract. Here's why those things are important to me. And then the producer has to decide, uh, you know, which of those, which of that, of, of those requests are actually pushed through to the contract itself.

So it's, it's not something I think that you can, that you can fully automate without there being a human in the loop somewhere. Mm-hmm. Um, I think if you do have that foundational infrastructure though, so if you know, Hey, where is my data coming from? Where's the, where's the sort of egress point of data from my system?

Where's my data landing? What are all the various ingress points? Then you can start to make some assumptions of, well, here's what the contract should be based on how people are actually using the data. And that streamlines that approval process. Um, but even then, you know, it's, it's hard to take people out of the loop there.

Hmm. I'm gonna keep moving because Peter has an awesome question coming through regarding language differences. I am interested in how we can apply data DevOps to platforms like Salesforce. We have several highly configurable SaaS used by different departments, custom applications where we have source level control and analytical systems that pull data from everywhere.

I need to harden and monitor the connections between all of these. Yeah, that's a great question. So. There's a few ways I've thought about doing this in the past for, for Salesforce specifically. Uh, Salesforce actually does have A-C-I-C-D integration. You have to pay for it, unfortunately. But it basically does sort of expose, um, any time someone is, uh, making a change within sort of Salesforce's underlying, uh, database, whether it's like adding new fields or piping data in from somewhere it goes through A-C-I-C-D process.

There are ways that you can intercept that and you can start to make sense of what is the sort of code change happening, um, on, in the, in the sort of Salesforce, uh, database workflow actually doing, and does it correspond to what the downstream consumers actually need? Uh, there there's other ways that you can piecemeal parts of this.

So like, for example, let's say that you're using, um, some event SDK, uh, and you're pushing data into Salesforce. Now if you're, if you're doing sort of, uh, relatively sophisticated dev data DevOps, you can identify that this SDK exists within the code base. Um, it is, uh, using the sort of Salesforce SDK, and you could figure out what that is.

And here is what data is being pushed into that system. Um, and you can then start to reason about what that data is, right? Like where does it come from? Is it coming from a mobile application? Is it coming from a survey that a user has filled out on a website? Is it coming through some other o other, other, uh, mechanism?

And you can at least start to get an understanding of where all the data flowing into Salesforce is, is coming from. Um, and then you can start to apply policies at that point. And then the last thing that I would say is, um, you can have a reactive component to managing Salesforce, which is not quite in sort of the DevOps space.

Um, but you can basically hit the Salesforce a PII think there's probably AI related things that you could do as well, where you're just like looking at the sa the, the, the Salesforce interface, like you're looking at the actual ui, which I think in some cases you're gonna have to do when you have these crazy custom fields.

And the only way that you can figure out what they are is by looking at them and gaining some human level context. Um, and anytime you detect some change happening within Salesforce, you basically say, what is going on? I'm gonna use this event to understand sort of the, the, the, the scope of the change and compare it to what the rest of the organization.

Expects, um, what's a little bit harder. And I think the primary value of data DevOps is you're able to communicate to the engineer in the workflow. They, they understand, right? Like a PR or an MR is a really critical change event. And you can insert yourself before that change event actually happens to communicate useful information.

Um, Salesforce doesn't quite have an equivalent to that. That's why that sort of CICD workflow is, is your best bet. Maybe there's something that gets developed in the future where, like within the UI you're telling the Salesforce developers or you're telling the salespeople that they're about to do something they probably shouldn't.

But I haven't seen that implementation quite yet. Well said. I gotta give a shout out to Gordon for asking this next question because it is the first time he has used Gable as a verb. How mature does the organization have to be on data management capabilities to begin Gling? I I love that. Kidding. I wanna, I wanna, I wanna use that in our marketing materials now.

Um, yeah. So I, I think it's going to depend on what it is that the company wants to do with Gable and the, and the, and the tech stack. Um, so like I said before, Gable's primary focus is, is on code. Uh, that, that's our sweet spot. And there are some, uh, companies that are using Gable that have pretty legacy code bases, like they will run us on, uh, like Java and they've like huge, um, financial applications that have existed for a really long time.

Um, Scala is another one. And what they just wanna understand is what's sort of happening within that code base? Where is data going? Where is it moving around? And then gradually you start to shift left with the data contracts. So you're layering on the ownership and, and the accountability in, in pieces.

There's other companies that are focusing on sort of more modern tech implementation. So they may have sort of a, an, an iOS app. They may have an Android app, and they wanna run Gable there to collect, like, Hey, what are all the events that we're publishing from our frontend systems? And, um, do sort of a, a ding of that.

Like, Hey, here's all the data that we're actually producing across these two different applications. Um, let's sort of start there and then we can begin, you know, sort of adding in a change management process later. So I, I would say as long as the focus is code, uh, even if it's Legacy Code Gable does, uh, Gable does a really good job.

Um, I, I would actually say that the more modern you are, like if you're using super modern technologies and tools and stuff like that, we probably don't do as, as good of a job, uh, because we, we, we were built for more of the older companies that are trying to handle their spaghetti code. Hmm. All right.

Well. There are some incredible questions coming through here, and my hardest job is now having to choose the next ones because we don't have infinite time. But I wanna pick out some really good ones. And this next one from Rafa is what are the kinds of data quality issues that we're aiming at mitigating?

Is it more focused on semantics and the birth of the data in a integration of similar data from distinct systems? Or is it the focus somewhere else? Yeah, that's a, that's a great question. Um, so I think it's a, it's a bit of all the above. We actually have a, uh, there's A-A-P-D-F, which, you know, I don't know, I'm sort of looking to mark on this, on, on how we can send it out, on whether we can send it out.

That contains a list of about a hundred, uh, data cons, uh, constraints that you could put into a data contract. And each one of those constraints you can map to some potential, uh, quality issue that could happen. And depending on the, the category of business you're in and the types of data issues that are most likely to occur, your, the gravity is going to be in, in, in a potential different, different category of problems.

So, so for example, there are some companies that I've talked to where the biggest thing that they care about, they don't really care about schema changes. They've got that under control. Um, business logic changes, like, yeah, that's, that's kind of important, but what's really, really important to them is how they handle, uh, sort of like GDPR, uh, compliant data, PII data, things like that.

And the issue is I am, I, I have a process. So when someone is building up a new service that is pulling in sensitive customer data, I have like a whole process and an approval workflow for that. But then anytime that gets spun up into an API, it could be consumed by anyone in the company that doesn't need to go through that process.

And that's really, and that's really scary. So that is a, a, a type of quality issue that you could solve through a problem like this. I think schema changes are obviously a really big one. Both, um, changes that are, that are happening internally. And when I say schema change, that doesn't necessarily mean, Hey, I'm, I'm removing a column that that doesn't usually occur.

It's more like there's new columns being added that contain updated versions of data that people probably care about. And that needs to be communicated to the downstream teams who then depend on that data. You could also sort of describe data being defined incorrectly from the source in order to accomplish some business goal as a data quality issue.

Um, so I, I would say there's a, there's a, there's a really, really large category, but it all sort of falls into what I would describe as, as change. Um, what data, where data, data DevOps does a really great job is where something in the system has. Changed and you can attribute, you can attribute a problem to a change within the system.

Um, now what, what data DevOps can't do is like if there's something that happens out in the world, right? Like if there's a presidential election and the way that people use the app has changed in some way, that that's not, uh, that that's not within the code base within your own engineering teams, um, that, that's not something data DevOps can handle.

But if you're looking at changes within code, um, that people might make that affect downstream systems and data DevOps is a really great solution. Alright, Chad, I got one more for you. And this is super hard 'cause there's so many great questions coming through, but I know you are gonna be around, so maybe you can answer them in the chat."

Chad Sanderson

March 30, 2025

Shifting Left with Data DevOps | Chad Sanderson | Shift Left Data Conference 2025

Get the ultimate guide to Data Contracts Deep Dive

Get the ultimate guide to Data Contracts as Code

Ultimate Guide to Data Contracts

Shifting Left with Data DevOps | Chad Sanderson | Shift Left Data Conference 2025

Talk Description

Additional Shift Left Data Conference Talks

Transcript

Read our latest Articles

Panel: How AI Is Shifting Data Infrastructure Left | Joe Reis, Vin Vashishta, Carly Taylor, Chad Sanderson | Shift Left Data Conference 2025

Chad Sanderson

Shifting Left in Banking: Enhancing Machine Learning Models through Proactive Data Quality | Abhi Ghosh | Shift Left Data Conference 2025

Chad Sanderson

Building a Scalable Data Foundation in Health Tech | Anna Swigart | Shift Left Data Conference 2025

Chad Sanderson