Talk Description
As Glassdoor scaled to petabytes of data, ensuring data quality became critical for maintaining trust and supporting strategic decisions. Glassdoor implemented a proactive, “shift left” strategy focused on embedding data quality practices directly into the development process. This talk will detail how Glassdoor leveraged data contracts, static code analysis integrated into the CI/CD pipeline, and automated anomaly detection to empower software engineers and prevent data issues at the source. Attendees will learn how proactive data quality management reduces risk, promotes stronger collaboration across teams, enhances operational efficiency, and fosters a culture of trust in data at scale.
Additional Shift Left Data Conference Talks
Shifting Left with Data DevOps (recording link)
- Chad Sanderson - Co-Founder & CEO - Gable.ai
Shifting From Reactive to Proactive at Glassdoor (recording link)
- Zakariah Siyaji - Engineering Manager - Glassdoor
Data Contracts in the Real World, the Adevinta Spain Implementation (recording link)
- Sergio Couto Catoira - Senior Data Engineer - Adevinta Spain
Panel: State of the Data And AI Market (recording link)
- Apoorva Pandhi - Managing Director - Zetta Venture Partners
- Matt Turck - Managing Director - FirstMark
- Chris Riccomini - General Partner - Materialized View Capital
- Chad Sanderson (Moderator)
Wayfair’s Multi-year Data Mesh Journey (recording link)
- Nachiket Mehta - Former Head of Data and Analytics Eng - Wayfair
- Piyush Tiwari - Senior Manager of Engineering - Wayfair
Automating Data Quality via Shift Left for Real-Time Web Data Feeds at Industrial Scale (recording link)
- Sarah McKenna - CEO - Sequentum
Panel: Shift Left Across the Data Lifecycle—Data Contracts, Transformations, Observability, and Catalogs (recording link)
- Barr Moses - Co-Founder & CEO - Monte Carlo
- Tristan Handy - CEO & Founder - dbt Labs
- Prukalpa Sankar - Co-Founder & CEO - Atlan
- Chad Sanderson (Moderator)
Shift Left with Apache Iceberg Data Products to Power AI (recording link)
- Andrew Madson - Founder - Insights x Design
The Rise of the Data-Conscious Software Engineer: Bridging the Data-Software Gap (recording link)
- Mark Freeman - Tech Lead - Gable.ai
Building a Scalable Data Foundation in Health Tech (recording link)
- Anna Swigart - Director, Data Engineering - Helix
Shifting Left in Banking: Enhancing Machine Learning Models through Proactive Data Quality (recording link)
- Abhi Ghosh - Head of Data Observability - Capital One
Panel: How AI Is Shifting Data Infrastructure Left (recording link)
- Joe Reis - Founder - Nerd Herd Education (Co-author of Fundamental of Data Engineering)
- Vin Vashishta - CEO - V Squared AI (Author of From Data to Profit)
- Carly Taylor - Field CTO, Gaming - Databricks
- Chad Sanderson (Moderator)
Transcript
*Note: Video transcribed via AI voice to text; There may be inconsistencies.
"So Zaki, I am honored to bring you onto the stage now. Thank you so much, Chad.
We will let you exit. And Zach, you've got a talk for us. Uh, you've got some awesome stuff talking about going from reactive to proactive, which is always useful. I'll let you take it over, man. Yep. Thank you Demetris. And you know, firstly, um, I love the camera quality Demetris. It's like watching a movie over there, dude.
Just wait till I break out the guitar again. Uh, hopefully we don't have any technical difficulties. Yeah, yeah, y'all, um, I'm gonna share my screen, but before I do that, uh, just curious, are any folks familiar with Glassdoor? Hopefully so. It's a well-known name in the jobs world. Very much so. I love Glassdoor.
I do too. I bleed green. I bet you do. Yes.
All right. So it's a very transformational time at Glassdoor. As many folks know, Glassdoor is, uh, well known for being a jobs and salary site, and we've shifted to becoming a community for workplace conversations. And that means that the insights that we gather from our data and the quality that's associated with those insights are more important than ever.
And so we realized about a year ago, as we were making this transformational shift that we had to make a really strong concerted effort in order to make sure that the data that we provide to our consumers are of the highest quality. Otherwise we won't be able to effectively measure the impact of the work that we're doing.
In order to do that, we introduced the shift left paradigm of class door. So I'm gonna go over what steps we took and the, uh, the shifts that had to be made internally in order for us to adopt this paradigm.
Firstly, there were three key things that, uh, that were pretty beneficial for us in order to drive the adoption of the shift left paradigm. Number one was culture change. The second were, uh, a shift in the tools and technologies that we had to leverage. And then I'll go over what the key outcomes and learnings were and some of the subsets that made up the characteristics of the culture changes and the tools and technologies that drove the adoption of the shift left paradigm of classroom.
And then after that, we'll wrap up with q and a. So going into culture change, one really interesting thing that we found at Glassdoor was that folks naturally gravitated towards the shift left paradigm. And so oftentimes when I preach about shift left at Glassdoor, the question that I often get from data folks is, how are you going to be able to drive adoption of the shift left paradigm within product engineering and QA orgs?
And so what we found was that product engineering teams are intrinsically motivated to build reliable systems. And QA teams, by their very nature, are focused on defect prevention. And so ultimately these teams taking responsibility for data production means that teams are better informed and data outages can be prevented instead of actively addressed.
This all goes into seamless collaboration. 'cause when you have your data producers talking to your data consumers, you're able to effectively, uh, you're, you're able to effectively ensure that the data steward isn't stuck in the middle. Trying to figure out what is it that folks actually want.
There were a couple tools and technologies that were quite critical for the shift left paradigm, but. The three characteristics that I want to talk about that enabled the shift left were firstly static code analysis. Secondly, data contracts, and thirdly, the right audit published paradigm. And you'll notice that these three things are all very proactive as opposed to reactive.
Going to static code analysis. We leveraged Gable do ai, and it enabled us to statically analyze the data producing applications to determine if there were any violations to data contracts, which then leads us into the data contracts. Data contracts are a way for us to formalize the agreement between data producers and data consumers.
When you pair static code analysis with data contracts, you're able to e effectively identify if a change that you're going to make upstream will have some impact to downstream systems, whether that might be ETL or ML models or dashboards or anything else. And then thirdly, once you make it into the data world, we follow the right audit published paradigm.
And I'll talk more about the, um, the similarities between the right audit published paradigm and blue-green deployments. But effectively, what the right audit published paradigm allows you to do is it, it allows you to check to ensure that the characteristics of the data, once it's actually live in some sort of staging location, conforms to the characteristics of the data contract.
You're able to audit them to ensure that all of those characteristics and those expectations are met before it's published to the consumer.
So what does this look like in practice? We embedded static code analysis with Gable AI within our CI cd workflow. So we have our data producing applications on the left, and we statically analyze that code to ensure that conforms to the data contract. Now, there are also certain characteristics that aren't easily identifiable, uh, when it comes to ensuring, uh, a high bar for data quality.
An example of this might be if you swap the first and last names within your data producing application. This will still conform to the contract, but really this is a business logic violation. And so we also leverage an alum through Gable AI in order to identify those types of characteristics. And if there are violations that we kick off a data quality alert, and there are a variety of, of, of opportunities that we have here, we could block the pull request.
We could also, um, raise this alert with the data producers. We could, um, do a soft block and, and, you know, request them to make whatever alterations before they merge to production. But ultimately, all of these mechanisms here help us to prevent bad data from making its way into production in the first place.
And if everything looks good, then the pull request is promoted into production.
Well, now the question is, what does data contract enforcement look like? And so once the data's actually being produced and emitted from these data producing applications, generally there's some sort of gateway to the data world. At Glassdoor, it happens to be the Confluent platform. And so we leverage Kafka and the Confluent platform also supports data contracts.
And so those data contracts reside within the Schema registry. And so you can see here that you have your data producing applications, it goes through the CI ICD pipeline. It passes the checks, and then it lands within the realm of the con platform. And if the data that's being produced conforms to the schema registry, then that data is then promoted to a topic.
And if it doesn't, then we kick off alerting. And we're also sending that bad data to a dead letter queue so that it's not impacting the infiltrating production systems downstream.
The third method that we employ is the right audit published paradigm. The right audit published paradigm is very similar to bluegreen deployments in the software world and the right audit published paradigm first stages the data in some sort of staging location. It allows you to audit that data to make sure that it conforms to certain expectations.
And if, if everything goes well, then you're able to swap the location from the staging location to the production location.
So now what does this full picture look like where you can see your data producing applications? We statically analyze that code. It then becomes, uh, it, it then gets merged into production. That production application is now producing data, and the data that's being produced is now being checked and audited within our stringing systems.
It is then checked and audit again within our batch systems. And then ultimately there are certain characteristics that are, you know, simply outside the realm of what we can do proactively. And so there, as, as Chad was mentioning towards the end of his talk, he said that there's certain things that are just outside the realm of what we can do within, uh, within the code base.
And so if there's, there, there are events outside in the world that are happening, like political events and things like that, that may change the nature of the data, you still need an omni detection and, and, um, and reactivity in that regard. So what were our outcomes and learnings from this? What we found is that obviously there exists some sort of data life cycle, but the data life cycle doesn't start from the bronze layer of your medallion architecture.
The data life cycle actually starts with the data producing applications, and that is the crux of the shift left methodology. You can start with data quality within the data, within the realm of the data stewards. You have to have a conversation between the data producers and the consumers that they align on what the contract should be.
And that contract can then be handed off to a data engineer that can then make sure that it's implemented within the realm of the data world.
So the, the, the top three things that we found were beneficial from, uh, shifting left was that there was alignment between data producers and consumers. I can't tell you the number of times that I've seen misalignment between producers and consumers and the frustration that consumers had because of that misalignment.
Similarly, that frustration then goes on to the data stewards because when the consumer's unhappy, well who are they gonna look to? They're gonna shift left and they're gonna look at the data stewards. And the data stewards are then the data engineers who are bogged down by all these requests and questions.
And they're reactive because they're getting all these ad hoc requests, which makes planning for them very difficult. And it's a very expensive process. I've even seen cases of, of, of, uh, data quality issues taking up to a month for us to investigate. 'cause we have to get all these folks in a room to understand what is the actual flow of data and what are all the points at which this data and the nature of that data is potentially changing.
And then a lot of times what we realize is that it's the data producers that need to take ownership of what's going on upstream and have a conversation with the data stewards and consumers to align on what the data, what the nature of the data should look like. So when we implement shift left, this results in increase in the trust and data.
And ultimately that is what we want for our consumers. You want the consumer to trust that the data that's being produced upstream of them is something that's reliable, that they can consume, that they can build models on top of that they can have reporting on top of that, they can build dashboards on, they can achieve insights from.
And if that isn't there, if your trust is not there, then effectively your business will fail.
So now I thought it'd be great for me to share, uh, not just the learnings, but also things that you folks can do in order to enable the shift left paradigm in your own workflows. The first part is that this isn't simply a technological solution that you can apply and it'll solve all of your problems.
There must be a conversation that's had between the data producers and the consumers. And so collab collaboration needs to happen. And once you have collaboration, then you need to protect your data assets. And protecting the data assets doesn't just happen within the warehouse. It starts from the data producing applications.
But now it's a question of how do you enforce that? And the way that we did it was by leveraging Glot AI within our CICD workflow and we were able to statically analyze that code and make sure that it doesn't violate data contracts. And then ultimately you need to continue to monitor and have alerting and observability in place so that you can tend to these issues as soon as they happen.
And in the case of proactivity, before they happen.
Are there any questions for me
there? Most definitely are always. Let's see. I'm sure the folks are. They come in. I have one that I see right now in the q and a, but while we're waiting for more to trickle in, let's, uh, let me just start with this one.
One of the key components is speed and responsiveness. Do you have benchmarks that can be used? I'm not sure if that is specifically for you or if that was for, if that's just in general like for data contracts type thing. Mm-hmm. But I'll let you take, take it as you will. Yeah. So the amazing thing about being proactive is that, um, you like when, when you're in a reactive stance, right?
Like the question is how fast can you tend to the issue so that that issue doesn't propagate, um, too much into downstream systems. But when you're proactive, you're not as worried about how fast can we run towards this problem, right? So that's, that's the benefit of being proactive is that, uh, when you stop that data from being produced it, uh, it doesn't mean that you have to run to it and, and take it up as an ad hoc.
You're able to have a conversation and you're not time bound and you're not worrying about, uh, the, you know, burning down the clock. So, um, so proactivity enables us to have time to have conversations. So can you give an example of actual data quality improvement after shifting left? Mm-hmm. Yeah. Um, so the way Glasser makes money is by selling ads for jobs.
And so for us, brand impressions are really important and, um, you know, it's, it's the way the, the, the flow of brand impressions works at Glasser is that it goes through many hands. So it goes from data producing applications through, um, through another application and then through API gateway and a variety of systems within AWS, and then it goes through ETL.
And so when there's so many handoffs, the question that we have is, well, what's the flow of data and what are all the points at which that data, where, where there could be data failure? And so with data contracts, we have greater reliability because we can ensure that the data that's coming in is of high quality 'cause it conforms to the contract.
And also if there are changes upstream, we can ensure that those changes are, uh, are discussed with consumers downstream and data stewards downstream before there are any breakages to ETL pipelines or dashboards and things like that. And so, because brand impressions are so critical to revenue at Glassdoor, that was one of the first things that we had tested out when we wanted to adopt the shift left paradigm.
And, uh, and so we benefit from fewer and fewer outages when it comes to brand impressions, and we also benefit from greater reliability of that data and greater trust in that data. So we benefited primarily from a revenue perspective, um, but also, uh, as you adopt more and more. Um, a, a a as more and more data assets adopt the shift left paradigm, this will go into not just revenue, but also ML models and dashboards and other pieces as well.
Excellent. Now, a few more coming through here. The, what does the initial collaboration discussion look like? Mm-hmm. Yeah, so, and actually, sorry, there was a little more to that. Can you walk us through an end-to-end example, and then to piggyback on that, another question that came through, which is in the same vein, is who owns this data contract implementation in the organization?
Mm-hmm. Mm-hmm. Those are some great questions. So I'll start off with the first one, which is, what does essentially the contract authoring flow look like? And, um, so the contract authoring, we, well, we use Gable in order to, um, solidify the contract. And generally what'll happen is we'll have a request for, uh, like a new field, uh, within the contract or, uh, a new field essentially within a data asset or like a new asset entirely.
And so the team that is requesting that asset will have a conversation with the data producer. So whatever software team is, is building that application that'll then produce data. And so within Gable, there's a UI that can be leveraged that allows you to essentially build your asset and align on the characteristics of what the assets should look like.
So the column names, naming conventions, the expectations, the schema, um, the data types, you know, associated with those columns. So all of those characteristics are then used to build a data asset, and the data producers and consumers then align on it. And it's essentially like a checklist. Like, okay, did the producer sign off?
Did the consumer sign off? Are the stewards on the same page? Once everyone's on the same page, then implementation can begin.
I think this is one of the best question and answer chat groups that I have ever witnessed. There are so many coming through here and it is really hard for me to have to choose which ones to ask, but I'm gonna do the difficult task. All right, Zaki, so bear with me Now, this may be super relevant to Glassdoor, but I would be interested if they came across diversity issues on their landscape.
When you have several applications that theoretically have the same capabilities but may generate different schemas, two types of ERPs or pss that then feed to the same consumer. If so, how did that get tackled? System per system, and if so, how to prioritize? Can you repeat the question? Um, uh, so maybe not, maybe not super relevant to Glassdoor, but I would be interested if they came across diversity issues on their landscape.
For example, when you have several applications that theoretically have the same capabilities but may generate different schemas, I see like two types of ERPs, pos then feed to the same consumer. Mm-hmm. I see. Um, yeah, so I mean, we, we've come across some cases where the nature of data should be the same across various surfaces.
So like the schema should be the same across like web, mobile within mobile, Android, and iOS. Um, and, um, and so what we did was we, we turned on automated, um, contracting through Gable, which discovered all the data assets and then allowed us to identify whether there's schema conformity across various systems.
And so does the nature of the data asset that's being produced within iOS match that of Android, does it match that of web, things like that. So, um, that's how we were able to ensure that you knows, schemas are consistent across various systems. Did you run into significant pushback when implementing these process changes, either upstream or downstream?
If so, what was the pushback and what were solutions workarounds? Yeah, so that, that was the thing that I was actually pleasantly surprised by. So I actually was anticipating pushback and, and it, it was so interesting to me 'cause I was like, is it, is it gonna come today? Is it gonna come next month? It go this week?
You had your shield out and everything. You were free. Yeah, I was ready. And, um, and no, like, folks naturally gravitated towards it and when they heard about this product, they wanted to learn more. And, um, and, and you know, this paradigm in general, like, they were just, they're like, oh, this makes so much sense.
And so, yeah, that was one thing that I noted in the presentation that like, product engineers naturally gravitate towards developing systems that are resilient, that are of high quality and, um, that produce, um, that, that, that have some sort of function that's usable downstream. And so, um, and similar to the QA engineering teams, it's, it's in every cell of their body.
They want to produce, uh, you know, applications. They wanna validate applications to, to make sure that, uh, the experience for those that are using the application, uh, are great. Right? And so that naturally extends to data. So how do you handle schema evolution? That's a good question. And so, um, that's something that we're working on right now, but essentially what you would need is just like the schema evolves, the contract also needs to evolve, right?
So, um, that's an area where there's still some improvement to be made. Yeah. Okay. How do you handle unannounced or unexpected changes to dorm data? I think this might be very similar to data formats or data content from data sources that are outside of your control, like external APIs or SCADA systems.
Hmm, good question. So, um, we haven't actually come across any changes to the actual, like data formats. Mm-hmm. Um, that's one that's not very common at Glassdoor. I'd actually be pretty surprised if like, you know, some data producer just randomly switched from like OC format to par kt, um, overnight without any announcement.
Um, what is most, uh, most likely to happen are like the actual changes within the data itself. So like changes to the schema, adding a column, removing a column, or multiple columns, things like that. Okay. Now Sivas asked this one and I gotta get to it. Does shift, does the shift left approach mandate the strategy of zero copy clone?
Or does it still promote the data platforms to bring data from producers to host on the dedicated platforms, which in turn creates copies?
Um, that's a good question. I think I'd have to follow up on that. Yeah, that some more. So I think we're getting to the end of this, but we've got one more, by the way, for the person to ask the question if you could, she reached out to me on LinkedIn. I wanna get back to you on that. Yeah, my, my LinkedIn.
Awesome. Awesome. Siva, you know, we'll be looking for that LinkedIn connection. Zach's gonna be waiting for you. Yep. So, um, there are, I'm trying to sift through. I'm telling you, man, this is great. I love all the ch the questions coming, keep 'em coming. And, uh, I did not do the thing of answering them all in order or asking them all in order.
So now I'm going back, making sure I didn't miss any. But for multi and hybrid cloud organizations with aspirations to move towards establishing open table formats in source systems, do we need to consider products like Confluent to establish common schema registry? Hmm. Interesting. Um, well, I, I don't think you necessarily need to be tied to a specific product, but when you shift left to the application code layer, you're not as worried about the downstream systems.
Um, but we still want quality checks every step of the way. So, um, you know, whether it's like a single schema registry or multiple schema registries, you just have to manage those contracts someplace. And the way that we manage all of our contracts across the board is through Gable. Okay. Uh, one for Zaki.
How did you track the performance and business value of the changes you made? Did you define and measure any specific KPIs or OKRs to show value to the business? The suits? Yeah. Yeah, that's, um, that's a good question. So, so and so interestingly, just like how product and QA engineering teams naturally gravitated towards the shift left Paradigm, folks naturally saw the value in it.
You know, the, the suits folks, the, the, the higher ups naturally saw value in the shift left paradigm as well. Um, especially when we were able to protect things like brand impressions like had mentioned earlier, right? Mm-hmm. Because if we see fewer outages and fewer issues there and fewer drops in those impressions, that means that there's less revenue loss.
And so that, you know, obviously when you, when you talk about money, uh, that piques everyone's interest. Um, so it was, um, you know, when it comes to like, it, it, it's tough to actually like, uh, validate what doesn't happen. It's very easy to test what does happen. And so we were able to see, from my perspective, a, a lower frequency of the number of issues that we saw tied to brand impressions.
Oh, brilliant. Okay. There is, uh, another one that I'm gonna try and sneak in. Can you implement data contracts if you don't have great product requirements from consumers or domain governance and ownership? Like no one to define the contract. Yeah. The contract has to be defined in order for this to work, right?
Like, that's kinda like the basis of, uh, of the shift left method. Um, without that conversation happening between the producer and consumer, there's no way that you're gonna have a successful, um, end-to-end pipeline that is going to produce some sort of trusted data. So, um, and, and I'd be interested to hear more about how it is that you're developing a product that you wanna gather insights from without knowing what it is that you want to gain insights about.
Um, so that, that, yeah, you, you gotta have the, the requirements gathered. I'll tell you how it's called Vibe coding. I don't know if you've heard of it. I have. I have heard some horror stories about it too though. Uh, yeah. So, um, let's see, I wanna make sure we get all of them. Have you thought how AI can leverage metadata to improve data processing analysis, or decision making?
Hmm, that is a good question. Um, internally within Glassdoor, uh, like within our platform teams, we are using some methods that, um, I can't discuss yet, but, um, that sounds exciting. Yeah, there's some, there's some interesting projects underway. Um, but more broadly we leverage, um, tools like Gable that, you know, leverage AI and LLMs.
And, you know, one of them that I discussed was, um, you know, reading through the application code layer to identify, um, you know, semantic challenges, um, you know, like swapping first name and last name, stuff like that. Um, so there are, there are pieces like that, that we have active and live, um, within our TRO platform and systems.
Uh, does this way of working impact data request workflows and how, if any, does Gable help mitigate negative impacts? Uh, this could cause? Mm-hmm. Yeah. Um, I'd say the, the change to workflow is beneficial. It gets everyone on the same page. Everyone's discussing, like the folks that need to be discussing like what happens to data, how do we produce the asset, how should the asset look like?
All the folks that need to be involved in that conversation are involved, which previously was not the case. And so, um, we didn't see hindrances to the workflow. We saw improvements to the workflow. Oh yeah. Fascinating. And then it, when you say improvements to the workflow, is that just because, 'cause I think one thing that folks like myself have thought about a bunch is like, isn't this just gonna add one more step and be a pain in my ass?
Yeah, that's a great question. Uh, what the real pain is, is you implement something, you spent two months, you built out this full pipeline, it's beautiful, you love the code, best for loops, best if statements, right? And uh, and then you wanna ship it and now it's, it's producing the data that the consumer doesn't actually want, right?
Mm-hmm. And that's actually happened to me before. And then the question that I got was, well, what did, like, what did the consumer actually want? And I was like, well, we went back and forth, but turns out the producer wasn't on the same page. And uh, and that was definitely painful for me 'cause I was literally asking for reviews on my pr.
I was like, I have my EPL job ready, I have my airflow dag, right? It's right here. I want to, I wanna ship it. And um, and then I got that brutal question and um, and the consumer ultimately was not happy. So yeah, data contracts have have shifted, um, the flow of, uh, essentially our workflow for building data products and data assets, uh, in a very positive way.
Excellent. Well, Saki, thank you so much, man. I'm gonna keep us moving 'cause you can just call me Ringo. I am the timekeeper. I gotta keep us on the beat. Today, that is what my job is. There are a few more questions that I didn't get to ask you coming through in the q and a and the chat. So if you have time to stick around, I'm sure folks would love that.
Otherwise, connect with Zaki on LinkedIn especially, uh, I think, was it Shiva? Yeah, Shiva. We're looking for you on LinkedIn. All right, that's it for now. See you later. Saki. See y'all."