Talk Description

Good Data and not Big Data is becoming more important in today's ecosystem. Machine Learning models rely on good quality data to make their model training more efficient and effective. We have traditionally applied Data Quality checks and balances in manual, centralized way, putting a lot of onus on our customers. Shifting Left Data Quality will bring the data quality checks closer to where data is being created, while preventing bad data from flowing downstream. Also auto-detecting, recommending and auto-enforcing data quality rules will make our customers job easier, while creating a more mature and robust data ecosystem.

Additional Shift Left Data Conference Talks

Shifting Left with Data DevOps (recording link)

  • Chad Sanderson - Co-Founder & CEO - Gable.ai

Shifting From Reactive to Proactive at Glassdoor (recording link)

  • Zakariah Siyaji - Engineering Manager - Glassdoor

Data Contracts in the Real World, the Adevinta Spain Implementation (recording link)

  • Sergio Couto Catoira - Senior Data Engineer - Adevinta Spain

Panel: State of the Data And AI Market (recording link)

  • Apoorva Pandhi - Managing Director - Zetta Venture Partners
  • Matt Turck - Managing Director - FirstMark
  • Chris Riccomini - General Partner - Materialized View Capital
  • Chad Sanderson (Moderator)

Wayfair’s Multi-year Data Mesh Journey (recording link)

  • Nachiket Mehta - Former Head of Data and Analytics Eng - Wayfair
  • Piyush Tiwari - Senior Manager of Engineering - Wayfair

Automating Data Quality via Shift Left for Real-Time Web Data Feeds at Industrial Scale (recording link)

  • Sarah McKenna - CEO - Sequentum

Panel: Shift Left Across the Data Lifecycle—Data Contracts, Transformations, Observability, and Catalogs (recording link)

  • Barr Moses - Co-Founder & CEO - Monte Carlo
  • Tristan Handy - CEO & Founder - dbt Labs
  • Prukalpa Sankar - Co-Founder & CEO - Atlan
  • Chad Sanderson (Moderator)

Shift Left with Apache Iceberg Data Products to Power AI (recording link)

  • Andrew Madson - Founder - Insights x Design

The Rise of the Data-Conscious Software Engineer: Bridging the Data-Software Gap (recording link)

  • Mark Freeman - Tech Lead - Gable.ai

Building a Scalable Data Foundation in Health Tech (recording link)

  • Anna Swigart - Director, Data Engineering - Helix

Shifting Left in Banking: Enhancing Machine Learning Models through Proactive Data Quality (recording link)

  • Abhi Ghosh - Head of Data Observability - Capital One

Panel: How AI Is Shifting Data Infrastructure Left (recording link)

  • Joe Reis - Founder - Nerd Herd Education (Co-author of Fundamental of Data Engineering)
  • Vin Vashishta - CEO - V Squared AI (Author of From Data to Profit)
  • Carly Taylor - Field CTO, Gaming - Databricks
  • Chad Sanderson (Moderator)

Transcript

*Note: Video transcribed via AI voice to text; There may be inconsistencies.

" Abhi, what's going on man? Yeah, it has been fascinating. I've been listening to all this great speakers talking about different things. Seems, you know, very similar to some of the things we are working on. So I'm really excited to talk about what we are doing things in the banking industry. And I just talked about, uh, things they're doing in a very regulated industry like healthcare.

Mm-hmm. And banking has been very similar lines. It's very regulated. So there are certain things that you just cannot do. Uh, and you have to be very careful with your data. So, well, you know what I'm going to talk about. So really excited to be here. Um, have been working with Chad and Gable for a while now.

So, uh, thank you for so much for inviting me to this conference and I'm really excited to be here. Of course, man. We're happy to have you here. And so feel free to share your screen. Sure. Thank you. And, um, just, uh, to give a brief introduction to myself, I'm Abby. Um, I lead the data observability and external data pipeline in, uh, at Capital One. Uh, watching your wallet.

I think everyone has seen that, uh, or heard it in multiple things. So I've been in Capital One for 13 plus years now. Um, really excited to be in this space. Um, software engineer by trade, um, have solved lot of software engineering problems at Capital One. Uh, but recently, over the last three, four years have been focused all largely on data and how we can make our data ecosystem mature.

Um, so with that, uh, let me jump right into the topic, and I think it's very important. I've heard multiple speakers talk about data quality, data observability. Um, and so would love to first just define like, what is data quality? Because without that, um, you know, we, we, that's the cross cover of the topic.

Um, so at Capital One, and I think in any big, like the, I'm hearing throughout this, um, conference, you know, trusting your data is, is core. And to deliver real time automated and intelligent experiences, you need to trust your data. And to me, here are a few key things. Like data has to be complete and timely.

The moment it is published, data has to be accurate and consistent for machine learning and data should be valued and unique to support automated decision. Now, I think all these things leads to the question is the data we are using is fit for its purpose. And we heard Andrew talk about how we are shifting left with, I think iceberg data products in ai.

We heard the panel, um, of Bar Tristan, um, talking about DataOps ability shifting left in the data ecosystem. So this is in the same vein, and we just had Anna talk about, uh, something very similar in a regulated industry. So that being said, you know, how, how do you, like, we come into, and, and that leads to inter data observability, and these are some of the common pillars of data observability.

One is freshness, and like I said, I lead not only data observability, but external data pipeline. And data pipeline is about, um, looking into, um, like data coming into Capital One. So these are the five main things that we want to worry about. So data coming through, let's say an external vendor, uh, is it arriving on time?

Is the volume accurate? Like we were receiving terabytes of data, all of a sudden we start receiving gigabytes or gigabytes, a megabyte, or all of a sudden something increases. How are we keeping track of it? The schema that we have agreed upon is that, you know, valid, uh, quality of the data. And, you know, that is one of the topic of this, um, this, this talk is, you know, it, it's, it's accurate or not.

And also the lineage. Um, do we understand how data quality issues are occurred and the impact of it? And so. How does data quality impact us? And here is a good representation of, you know, how, like different personas, how it impacts us. Um, you could be a product manager, you could be a data scientist, you could be a developer, um, you could be a data analyst.

And to me, everyone has a stake in this game about the maturity of the data ecosystem. So traditionally, like we, I heard some talks around enterprise versus non-enterprise, and I think what enterprise meant was something pretty big in scale, very distributed. Uh, the data, like the company wasn't born with that mindset.

Oh, data is the first class citizen and we are going to like, build everything around data. Like Capital One was definitely like, it's an information based service. Um, and uh, it was like data has always been a prime, but not the extent that we think about it today. So like, depending on the person, uh, it can impact you in different ways.

So now let's look at shift left. And a lot of, lot of speakers before me have talked about shift left. So, uh, I don't think I need to talk about what is shift left. Everyone understands that part. But, uh, let's look at, um, a traditional way of deploying enterprise solutions. And when I say a traditional way is big enterprises where, you know, capital One is definitely one of them.

Where we have, you know, petabytes of data coming through internally, externally, like how do you like really leverage data? And the traditional way has always been okay, we take the data, we put, put it into a data warehouse and then, you know, we will do something on that data warehouse. So, so to me, like, you know, some of the traditional way I'm a user, I come in, I have some user interface, I may have a config, API, I am going to like, you know, put into the user interface some information and say, okay, I want to run data quality checks on a data and here are some of my data quality rules.

And that gets stored in a database. Now I have, you know, created some configuration. I have a deployer, I have a scheduler now that runs on an execution engine. It could be anything around Spark, um, or anything or data pipeline that you use. Um, it does some pre-processing. It's executes, it does some pre-process post-processing.

Um, it goes to the data warehouse. That's where you know you are going to run the thing and it going to store, um, those things on a, in a results, uh, database. Then, you know, there could be dashboards and monitors that people look at it. Uh, there could be bi tools. Uh, you are creating bunch of reports. So to me, like a traditional enterprise um, solution, you'll find that a very common architecture pattern.

People run it, it's after the fact, like data has already landed into data warehouse. You don't know it's good, bad data. You may have already consumed it, and then you're running some data quality checks. So it has worked. It, it is one form of really running data quality checks, but I think it's not cutting it anymore.

Um, this, so, and here are some of the challenges, and again, it's not exhaustive, but this is like, we have a centralized installation, so it only works within centralized installation. What happens if now you buy another company and you know, capital One has been in the journey of buying a few companies, uh, and uh, or, you know, taking, taking, um, and so how do you assimilate that into it?

Uh, they may not be all connected to an enterprise solution. Uh, also the user has to come to this centrally managed solution instead of we going to the user. Uh, and then rules define are very manually in nature. Um, a lot of you are coming in, you are understanding the data. So someone really needs to understand the data and then manually enter it.

And then, like I said, this is much later into the data lifecycle. The data has already gone through your pipeline. It has landed into your data warehouse. You are running it. And you know, that has always at least caused challenges in big enterprises where, hey, I've already started using the data, but the data is not of great quality.

So I think like any other enterprise like this are some of the challenges that, uh, is industry has faced already, still facing, uh, in big enterprises. So how do you change it? And there are multiple ways I'm going to talk about, uh, we have talked about a lot of concept conceptually in some of the talks, but I'm going to go into the nitty gritty details about a little bit about how we are going to make some changes to shift left.

Uh, so one is like, um, we, you know, compostable services, and this is nothing new. Compostable services has been there in the industry for a long time. These are SDKs jars, you know, whatever you want to call it. Uh, like if you are a Java developer, you work with jars, uh, SDKs have been there, there are packages.

Um, and so like the basic principles of that is, uh, it's federated modular services. So, you know, it's a smaller service. Um, it's, uh, deployed agnostic of infrastructure. So you don't need like, ah, I need a whole infrastructure to do it. Um, it's indicated with other platforms. It's, it's not only just works in silo, but it's also, uh, it could be managed.

It could be distributed, but you know, like it, it integrates with other platforms. Uh, it should have a scheduler and it should be configurable. Uh, we talked about, um, uh, data contracts, policy as code. A lot of this ties into how you want to build some of these compostable services. So now that we have defined compostable services, um, like how does that fit into the new mold of how we are building architecture?

So we talked about traditional way and, and so, and again, all of this could be changed. So you can build now multiple compostable services and chain them together. And you'll see one example here. So, um, so again, the same thing. Some data user is coming in, um, they're going through a user interface. Um, they're putting the thing and then, uh, you know, you have a data pipeline and, and that is running closer to you.

You are a producer and you don't need to use a, a centralized system. Uh, you can have like a data publishing, SDK, where it is publishing to a data warehouse. Uh, it's part of, let's say your spark, uh, data pipeline. Uh, in that you now have another, uh, compostable service, which is like a data quality jar. So you are now chaining multiple compostable services together.

And data quality jar does something similar. It takes things from the data warehouse, runs, things it, you know, goes to the rules database, uh, s the data runs, puts result storage in it. And then the same thing. But now you have like really moved, shifted, left, um, by just going to a compostable service. Now sounds simple, but it's not that simple.

Uh, there are multiple things you need to consider when you're going into compostable services. And this could be multi-tenancy. Single tenancy. So if you are like, Hey, you want to like have this completely independent of a centralized, you want to make sure you are composable services are up to date, and then you are deploying the whole component in a complete different location.

Uh, if you're using multi-tenancy, you have to figure out, okay, all the data that you have that has to be multi-tenant. You want to segregate the data through access, how you are controlling the data. So all of those needs to be considered, again, depending on what your architecture is, uh, you can build all those

now. That's one way of compost, like using compostable service. You're shifting left, you're moving things, uh, closer, uh, to where the data is being produced. Uh, but there are other ways, like we have talked about, um, you know, data, uh, for ai, like we said, hey, we need data, good quality data for, uh, building better models.

We want to do, um, like just, uh, reduce time, make it more efficient. And data quality is definitely, uh, uh, a big component to it, but it doesn't, uh, like comparable services one. And I'm going to talk about a few other techniques that, uh, could be used to shift left and create better, uh, customer experience.

So one is dynamic thresholding and you know, there are tools out there, open source things like A-W-S-D-Q that is open source and that provides dynamic thresholding. Now what is dynamic thresholding? Dynamic Thresholding is, uh, you are going to say, Hey, I'm going to turn on dynamic thresholding. And what it will do, as the data is going through your data pipeline, it's going to collect the data in a storage, let's say, uh, AWS S3, uh, or any other storage.

And then, um, you, it's going to do like a historical calculation, uh, based on the data, like on some thresholding. So if you have data, you have a column, which is, you know, your, your, you have intes and it ranges from zero to 30. It is automatically going to start like creating threshold zero to 30. And then if it doesn't find, like data within that range is going to start like telling, Hey, your data somehow that came in is not correct.

Uh, the next technique is rules recommendation. And this is like another very interesting one. So you can take the data, uh, you can profile the data, and then based on the profiling the data, you can create metrics and form the metrics. Uh, you can write and figure out rules recommendation. This is very similar to dynamic thresholding, but you have a lot more flexibility about how you can create the rules, um, and then you can present it to the user.

So user, a big problem that you have seen is like the users have to really understand the data, they have to understand what the data quality rules are, and they may be missing out on a lot of things. So this is really helping you help your producers to say, here are some rules, and like you can accept it or you may, you may reject it, but we are really helping you using machine learning, uh, to come up with those, uh, rules, recommendation, and then propose it to you.

Um, and normal machine learning, you have detection, change, point detection. Things that you, you can do after you've created all these rules and thresholds. Yeah. Or maybe you didn't use automatic ways, you just created them. And then you use e detection and change point detection to really figure out like how the quality of your data is.

So we are moving from the static way, we do data quality into a more, much more dynamic, uh, way of data quality. So these are some of the techniques that, you know, industry are using, uh, to get to a better, but we are really shifting left, um, some of the ways that we are doing things. Um, and this are my last slide.

And so, so what's next? And I want to keep, uh, honest with the time, uh, that was given. Um, so we, uh, you know, continue to advance the heart of with innovation experimentation. And, and Gable has been a great partner to us. Um, we talk about like, you know, predictive lineage. So we have talked about lineage. And Lineage is, hey, you know, it's a, it's almost a supply chain.

You create a visual view of how the data is moving through your data ecosystem. Uh, but you know, it's, it's still manual. You can create some auto collectors. You have open lineage, and again, people who are in the space, they will understand. Uh, you can go and write some, you know, collectors from Spark. You can write some, um, SQL collectors from, uh, snowflake and things like that.

Um, uh, similarly for databases, uh, but still is at the end of the day, you cannot connect everything, especially if you are in a very, very federated data ecosystem. And so how do you use the, you know, technologies, um, like LLMs, um, and um, just static analysis to combine it and try to bridge some of those lineage gaps that normally exist in a very federated data ecosystem.

And, uh, uh, I don't need to talk about data contracts. I think like there have been a lot of talks around data contract. Uh, mark just talked about data contract, uh, in details. But I think again, that having data contracts and how do you enforce it and push it and shift things to the left is something that, uh, you know, companies are trying to look at it and seeing how that, um, you know, really helps shift the momentum to the left and, uh, really catches like, catch data quality, um, at much earlier than what we have.

So going from very later in the cycle, catching data quality to really catching data quality issues, uh, much earlier in the life cycle where data is being created, when, you know, even in build time, uh, is really going to help build a mature data ecosystem. And, um, take us to, uh, a system where AI models are getting much more efficient, how they're being trained, how their data is being fitted.

Um, so, you know, that's, that's the future. Um, we talked about all our LMS and mls and so all of them, uh, will benefit from shifting left. Um, that concludes my, uh, presentation. I wanted to be, uh, aware of the time we have. I think overall, we, uh, make my job easy. Thank you. That is excellent. And there's some good questions coming through in the chat, so I'm going to fire away.

First one is, when should organizations invest in centralized installations versus federated customized composable builds? So it depends on the maturity of your, uh, organization. I'll say, right? The moment you start shifting things to the left, you are expecting every team to have a good understanding of all these things.

And that may not be easy in a big organization. So centrally is like, Hey, you have a core set of people who really understand the technology and, you know, they're managing all that. They're making it efficient. Because if you start running data pipelines without really understanding data pipelines, we'll run into issues of performance.

You'll be running into issues about, you know, how to segregate, like Alize the data, running them, running data quality, how to really comprehend them. So again, it's, it's depends. And like different companies are in different stages of maturity. So, um, I will say like if you are very much in the beginning, uh, then go for a centralized, it just helps from, uh, managing, uh, do the right thing.

And then as you mature and you know, get more people trained and you feel confidence in your overall organization or your team, you start shifting left. Mm-hmm. What is your experience in using the open lineage on Spark? Um, open Lineage has worked well. Um, it's still, I think we start looking into open lineage for Spark for Flank.

Um, there are open lineage for, uh, airflow, um, when it started, I think it was flaky to be honest. Um, so there are different, like an open source like Spline, which has worked well. Um, so, but it, it's catching up. So Open Lineage is something that is becoming more standard and I think people are moving towards open lineage.

So, uh, different adapters can be written, one for open lineage, one for non-op lineage. Um, and as it becomes more mature, it, it does, I think Open Lineage has some challenges of getting limit level lineage, so I think that has to become more mature. Oh, okay. Alright. So. There's a good one coming through here about dynamic thresholding, which actually I was curious when you brought that up too.

So how successful are you with Dynamic Thresholding and what are the challenges mitigation paths you have seen or addressed? Yes. Dynamic Thresholding has is, is still experimental, I would say. It's, it's still figuring out how well it works and it takes time, right? When you introduce something new like this, um, a Ws dq, like I said, open source has this capability, but the thing is, is always about trial and error.

So the way to do it is like you introduce that feature, maybe you go through some beta users to see how it's working, um, check the validity of it, the correctness of it, and then introduce it to a higher population. I think that's how enterprises do traditionally. Like they don't go all out. They like go for a set, um, group of people.

Uh, but you know, it's, um, it has worked well. Um, and the thing is, like he, you opt, you provide an option of opting out if you think it's not working well and you collect the data point and then you can adjust accordingly. So, okay. The last one that I think is more of a, it's a little bit more general coming through, it's how can we navigate banks and legacy organizations attachment to old enterprise softwares as opposed to adapting new technologies or open source, which I would argue banks tend to use a lot of open source, but that's a whole nother story.

Uh, how, it's basically how do you play with that juxtaposition of working with legacy and trying to do the most cutting edge? Uh, I'll say that it has to come from the top. Uh, you know, there, there's a huge investment that needs to happen. The culture has to change and I think I'm incredibly lucky to be in Capital One, where we have a CEO who is constantly pushing the boundaries and like looking into the future and building technologies that are like much more forward looking.

And so there has to be a huge investment that needs to be made to take from a legacy mindset to a much more forward looking technology landscape. And this is what you're seeing in terms of some of the technologies we talked about. As, you know, this is general architecture, but you know, we, we are very much aware of some of this and how we are implementing it, and it can only be possible because of how we have taken from an old banking mindset to like a modern technology driven banking industry.

Yeah, 100%. Well, dude. Abby, thank you so much for coming on here. This has been brilliant. You make my life easy 'cause you finish on time and we still had plenty of time for questions. I appreciate this so much and uh, yeah, I'll be seeing you on LinkedIn. Thank you. Thank you so much."