Talk Description

This talk covers Adevinta Spain's transition from a best-effort governance model to a governed data integration system by design. By creating source-aligned data products, this shift aims to enhance data quality and reliability from the moment data is ingested.

Additional Shift Left Data Conference Talks

Shifting Left with Data DevOps (recording link)

  • Chad Sanderson - Co-Founder & CEO - Gable.ai

Shifting From Reactive to Proactive at Glassdoor (recording link)

  • Zakariah Siyaji - Engineering Manager - Glassdoor

Data Contracts in the Real World, the Adevinta Spain Implementation (recording link)

  • Sergio Couto Catoira - Senior Data Engineer - Adevinta Spain

Panel: State of the Data And AI Market (recording link)

  • Apoorva Pandhi - Managing Director - Zetta Venture Partners
  • Matt Turck - Managing Director - FirstMark
  • Chris Riccomini - General Partner - Materialized View Capital
  • Chad Sanderson (Moderator)

Wayfair’s Multi-year Data Mesh Journey (recording link)

  • Nachiket Mehta - Former Head of Data and Analytics Eng - Wayfair
  • Piyush Tiwari - Senior Manager of Engineering - Wayfair

Automating Data Quality via Shift Left for Real-Time Web Data Feeds at Industrial Scale (recording link)

  • Sarah McKenna - CEO - Sequentum

Panel: Shift Left Across the Data Lifecycle—Data Contracts, Transformations, Observability, and Catalogs (recording link)

  • Barr Moses - Co-Founder & CEO - Monte Carlo
  • Tristan Handy - CEO & Founder - dbt Labs
  • Prukalpa Sankar - Co-Founder & CEO - Atlan
  • Chad Sanderson (Moderator)

Shift Left with Apache Iceberg Data Products to Power AI (recording link)

  • Andrew Madson - Founder - Insights x Design

The Rise of the Data-Conscious Software Engineer: Bridging the Data-Software Gap (recording link)

  • Mark Freeman - Tech Lead - Gable.ai

Building a Scalable Data Foundation in Health Tech (recording link)

  • Anna Swigart - Director, Data Engineering - Helix

Shifting Left in Banking: Enhancing Machine Learning Models through Proactive Data Quality (recording link)

  • Abhi Ghosh - Head of Data Observability - Capital One

Panel: How AI Is Shifting Data Infrastructure Left (recording link)

  • Joe Reis - Founder - Nerd Herd Education (Co-author of Fundamental of Data Engineering)
  • Vin Vashishta - CEO - V Squared AI (Author of From Data to Profit)
  • Carly Taylor - Field CTO, Gaming - Databricks
  • Chad Sanderson (Moderator)

Transcript

*Note: Video transcribed via AI voice to text; There may be inconsistencies.

" All right, so we're gonna transition this train over. Oh, mark, you're back. Look at that. I'm back. I'm back. And I'm super excited to introduce Sergio and you gimme some quick context.

Um, I was doing my data quality course creating that, and I was researching like who's doing data contracts in the real world outside of Gable, um, who's actually taking his best practices and kind of pushing this forward. And I came across Sergio's article about their actual homegrown implementation data contracts.

I immediately found him on LinkedIn. I was like, I have to have you talk with, uh, on our conference. Um, please share what you're doing because I've read their article. Um, please, I'll go find the article, try link in the chat as well. It's a great read. Um, but you'll hear from him right now. So welcome Sergio.

Thank, thank you very much. I really like you. Enjoyed the article. I said this is our humble implementation. I'm going to talk about it. I hope you interact. I'm going to share the screen. Okay. All right. I can see it. Excellent.

Yes, that's it. So yes, that's it. I'm going to talk about our implementation. It may not be the, the best one, but uh, as Mark said is the one that is working for us and we are achieving some success with it. Right. So just a little bit, uh, about me. Well, my name is Sergio, as you said. Um, I am originally from aia, which is this small town, well for European standards, a small, small, medium sized town in the top left corner of Spain.

I got two kids. Uh, I work for several companies now. I work in in Enta. And what is Enta? An online, an online classified sub company. The focus is in Europe. Uh, we also have some, some marketplaces in, in Canada and Brazil. But the main focus is, uh, is in Europe. I'm working and my team, we are working for in Spain.

And in Spain we have these six, uh, sites. Uh, we are managing different kind of data from real estate to job searching, car selling, so, uh, and that's it. And, and generalist site as well. And to give you just a glimpse about the amount of data, we are, uh, we are processing, we have 600 terabytes in the data lake right now.

And we are ingesting daily around four terabytes of, of data. So let's go to the, the challenge we faced when we started, uh, to, to implement data contracts First, what we call an event. Uh, the kind of data we are processing in this company for analytical purposes is, um, is the, the evidence of the client behaviors in the site.

Know, every time a client click a bottom in a site, publish an a, delete an a, save an artist as fabric or something like that, and everyth is generated and through a microservice is sent to Kafka. Uh, these microservices work also in the operational world, but we need to, to gather that data from the operational world and carry it to the, the analytical world, right?

So this is the, the only ingest pipeline we have. We had, um, the first issue we, we found here is that we have a regular expression to match a topic name. So every topic in the company was being ingested if the, if it matches a prefix. So we end, uh, we find ourselves ingesting data without use cases. Uh, this, this way the, the costs are higher, both in storage and in processing.

And we need to fix that. We need to fix that. Another issue is there were no can validation, even though the producer in the operational world, um, define an schema in a repository, uh, data schema is used by them in the to, to operate the website through meter services and so on. But in the data wall, we only use that.

We only used, uh, data schema to create the data. Once, uh, the first time the event arrives, we use the schema and create the data, the table, sorry, the output table. We have schema, schema, how do you say schema evolution. But we didn't, um, in this whole ingestion pipeline, validate the schema for every event that arrives.

And so we have an inspect error. Someone change the, the data without changing the schema or, or, or the otherwise, the other way around. And also we have silent errors because some s some schemas were somehow compatible but not fully compatible. So we, we find our ourselves, uh, with a new, uh, row in the, in the to table.

And also, this is schema definition is just that an schema definition. There's a list of fields with the types. There is no information about personal data. If the A PII has said in the first, uh, in the first talk, um, so we, the, the identification of the personal data relied, uh, completely in, near, in each, each of the producers team.

So there were no, uh, central management for the personal data in, uh, in the company. This one, this make the GDPR compliance, which is a huge deal in Europe. Um, a pain, uh, we have to, we spend a lot of time working with that. And at the end with this whole, uh, old pipeline, we found ourselves in the, in the middle.

We always, um, ha handling issues with the data from, from the producer or from the consumer data, which the schema, uh, mismanagement or something. Uh, we spend, we calculate, uh, around two days per week working, uh, to fix issues within the, the ingesting of the data, right? So we consider ourselves in the times, uh, firefighters instead of that engineers, right?

Instead of focusing on creating value for the company, we focus on fixing issues. Well, this is a summary, how you already go. Yes, I already go, went around all this. So let's go to the solution. So this, the first, uh, issue that we need to change with the solution was, uh, our position in the data flow, right?

We want the producer and the consumer to speak about among, among themselves and reach the data agreement. And we as a data platform, uh, being facilitators instead of firefighters, right? We want to provide them tools and, and procedures and, and everything needed in a, so the data flows from the producer to the consumer, but without us being in the middle, uh, fixing issues in a, in a daily basis, right?

So let's start with the contract definition, which is just, uh, just no more and no less than, than ization. And this one in the left and in the right you have the detail, right? So the first is the contract name, which is the, the event name in the sample, uh, app publish. And new app was policy in any of the sites.

Uh, so this is the contract for it, a contract version, obviously to keep a track of it. And then I start date and, and end date. This is really important because, uh, well, as the, as the operational world evolves and the size evolves, the datas and the fronts and whatever, um, the, the evidences evolve as well.

So, uh, what, what is valid now may not be valid, uh, uh, tomorrow, right? So we can, despite contracts and create a new one with the proper version of the, of the new event then is the, the schema definition, which is, uh, just a new RL to, uh, to the previous schema repository where the operational world defines their schema.

This is because we didn't want to add overhead to their process, right? Uh, the adoption of data contract is always an issue, as someone said before. Uh, so we use the schema as it was, right? So then, um, the Kafka topic, just the name of the topic, because the Kafka configuration is the same for all the events per site.

So it's in other place, right? It's in the code, uh, in a configuration file. The, the user doesn't nor produce, nor the consumer needs to be aware of it. And then here's the, one of the most important section is the personal data section. It's optional because, uh, not all, not all of the events have of personal information.

So in this, in this session, we have, um. Of the, and every one of the fields that are, um, carry personal information with their name, pay name and pay email in this example. And then we have a volume analytical, which represents if this field has analytical purpose or not. I, if it does not have analytical purpose, we delete it completely from the ingestion.

So this way we minimize the amount of personal information in the, in the data platform. The other way around, if the, the field has analytical purpose, we add an optional mapping. Uh, so we're started, started in a, in a, in a well-known name, right? Because, uh, here payload name may be a bit, uh, weird to a store and name is, is more easily identified.

And then we need, um, we want also to separate personal data from non-personal data in the data platform so we can apply access control in a more, um, secure way to the personal data. So then, then we need something to relate the personal data to its, uh, equivalent, uh, non-personal data. And this will have the relation key.

The relation key is the idea of the event normally. Um, so we can join, uh, the personal data with non-personal data if needed. The consumer can, can join that. And then we have the user ad, which is just the, the client identification. So if the client, uh, claims it's, uh, right to be forgotten, no, for instance, we can delete all of their data, uh, in a single, in a single execution.

Right? And then we have the SLA section, which is the, some metadata. First is the source, which is just the metal service that produce the data. This, um, allow us to group events from, from the source, right? Then we have the owner and the contact, uh, way, uh, to them. So if the, if there is some issue with the data, the consumer can check the contract and ask the, the owners for what happened, or maybe they forget to, to update the contract or whatever.

This is normally a Slack channel or anime. Then we have a periodicity, hourly or daily. That's the two options. And if it's daily, an execution hours or 7:00 AM 8:00 PM or time to recover, which is, uh, by the fall seven, seven days. And retention is, uh, something that we still need to implement that is, is, is, is in the contract, but it's not implemented yet.

We, we want to delete the data. Once the retention is, uh, is we reach the, the retention time. So we, this sample 90 days after 90 days, this data will be deleted. And then provide us idea. This is just something internal from, from the company. Uh, same event may apply to several websites. No, uh, photo casa.com and avatar.com.

Uh, so the event came from those two sites. Right. Okay. Now we have the contract, but we want to ensure this adoption, right? So first, uh, we don't want the producer to, to have to create the contract manually. This works, uh, this worked for the MVP, but, uh, as the process mature, we want it to be in a more automated way.

So what, um, so this is the lifecycle, right? So the producer from the operational world and the, um, data consumer, uh, data scientist or, or data is from the data world, analytical world, uh, reach a data agreement, the operational, the data producer defines the schema and, and store them in the schema repository.

And here comes, and here comes the, the automation that we implement. Uh, this merge of the schema definition in the schema repository will create, will trigger a GitHub workflow that will create a counter proposal. We use generative AI to, to infer the personal fields mainly, but we want to, to improve it.

But so far, the generative generative AI using AWS, uh, bedrock, we, um, infer and list the personal fields in the contract, and we notify the, the user once the contract proposal is created so the producer can iterate over it and finally reach a, a, uh, a well form data contract. We have also automatic validations, so the, so the producer can check by itself if it's okay or not.

We as a data platform, we are just, uh, oversighting, but we are, we don't need to be involved here in, in nothing more than be aware of some, some error came or whatever. Once the contract is okay and is validated through the automated, the automatic validations, then we trigger several hubs, uh, workflows to, to schedule it, right?

So the first we'll create the ing airflow to schedule, uh, using the information from data contract, our daily work. Then we will create data table, depending if the schema is a measure or a minor version, if it's a minor version, a new table will be created. If it's a minor version, the existing table will be updated.

It will also create the personal data table. And we send methods to a, so the producer and us as a facilitators can check for the, for the creation of the tables and see if something is right. And then finally, we will upload the data counter to free. So the, the later pro the ingestion process will be able to, to read it, and then the, the data consumer will merge and activate the data, right?

This is in the data world because the, the airflow is in, in the data platform. So we have the contract proposal. This is the, this was the lifecycle. But what is a contract proposal, right? So what is, this is nothing more, nothing less than the same contract, but with some, some fields left for the producer to fill the contact support that ity out and to recover retention.

Um, some of these fields, the PII Fields, for instance, we. Um, we define, we infer it through generative AI orders such, such as the owner, the source, and the Kafka topic. We gathered it from the, the company repositories. We look for the event name and then check the, and then gather the microservice who is creating it, uh, who is the owner of that microservice is the same owner as the data, uh, and so on.

This allow us to have a counter proposal fairly complete. Almost, almost everything is complete other than the SLAs. We also recommend the producer to check the personal field because, you know, it's, uh, proposal, they know the, the data better, right? And then we reach the run time. We already have the contract and the airflow will trigger prevent one, one execution, prevent of spark, uh, spark engine first.

We will first the process will get the contracts and filter life once based on start end date. Then we filter by site, event and version because the date, some topics are multi-event, and then we apply the s scheme. Uh, for every event, uh, we are ingesting, we apply the schema. I think something is wrong. No matter what we send, we, we will send the data to the quarantine database.

If it is, if the schema is right away, we delete the non analytical personal data based on the information from the contract. And then all again, based on the information from the contract, we separate analytical personal data, um, from the non-personal data, we, we will generate two data frames and store each one of them in a different table.

The non-personal data, personal table, and the personal data and our personal personal data table where we can apply, uh, a more secure, uh, access control. And obviously there are metrics and alerts and so on. We'll talk about it later now in, in fact. So, um, what about observability and alerting? Uh, as I said, the process will send metrics mainly about biometry because it's something that is really important for us to detect when there is a sudden spike or, or a low pointing in biometry, but also error detection, uh, if there is an error of any kind normally is due to a scam of mismatch.

But it is, it can be any other issue. For instance, if we blow a, a book or a back or something, uh, uh, alert will be erase and to a Slack channel. So anyone in the company, both consumers produce or facilitators or who, whoever it wants can check it, right? So once the an error is, uh, is sent to the alert channel, then the producer can check the quarantine database and fix upstream or fix the contract or whatever issue it is, uh, fix it.

And interprocess the quarantine database is not for reprocessing, it's just to check the error. Then the data is always rep processed from the, from the server. And then we also have alerts about volume. Um, if an unexpected number of events, uh, is ingested for a given event, a given contract, uh, we will trace an alert and the user can check and the producers should check upstream if it it's right or not.

Because it, it may be right. No, it be, it may be an special day that there is no business or, but we also, uh, we also want observability about cost and value. Cost is, uh, we have an, um, a pretty, a operative fair, a pretty good, uh, dashboard. I, I think, uh, which will the producer will be able to check, um, all of their cost both in storage, in, in processing, in data briefs, in, in AWS everything.

And there is an alert also if there is a threshold that, uh, if the cost still that argument threshold, there is also a dashboard about value. Uh, here we don't have monetary value is, uh, still difficult for us to, to gather that, but we have a pretty good approximation, we think, which is, uh, the amount of people that is by the people, the processes and people and models and dashboards, everything that are consuming a given table.

So they use, the producer can check against this cost. Uh, this is costing me a thousand dollars and there is no one checking it. So that, that's not, that, that's not all good, right? Um, just coming to an end, well, uh, it is been a two year journey, uh, building this solution of data, contact source ingestion, but we still have, uh, some things to, to do.

Uh, for instance, we want to improve the proposal and give the producer confidence enough to avoid, uh, them to review the contract, right? We also want to merge the contract because, uh, right now the producer check the proposal, fix it, and then merge the contract. And we also want to automate and to deploy and to activate the data, uh, from, for the data to be consumed.

So this, if we had, if we achieve all of that, we will be removing every human interaction from the lifecycle of a contract. And we think that will be, uh, our rate, our rate achievement for, for the company, right? For the adoption of data contracts, uh, and everything. It's not because we hate humans, it's because, because it's because humans are prone to error.

Um, they have may have another issues that working more important or something. So if we can avoid human interaction in this life cycle, we think the user experience will be really improved. And as a conclusion, well, we, we, what we achieve with this implementation, we achieve a centralized only five management for personal data, uh, which is a great advance for us.

Um, we separate private, provide data for non data. Uh, so we can apply, uh, different actors control on the preva data. We have schema validation, both in definition time and run time. That may look pretty obvious, but we didn't have it. No. So it's an achievement for us. We are ingesting all the events with an analytical purpose.

This where we are cutting costs for the company. So if, uh, an event has no, absolutely not an analytical purpose, uh, just simply, but not creating a contract, it won't be ingested, right? And the producer can check cost and who are using the rent in our approximation of, of be able to compare cost and its value.

And just some numbers here. Uh, well, we went from, we don't know, percent of personal data identified in the data platform. We don't know. We didn't, we don't, we didn't know because this, uh, fell on the, each of the producers teams. So there were, there were no centralized way of knowing this. Right now we have 40% of data and personal data identified in the company, and 65% of the producer are, are aware of their, of the data, of the value of the data.

This is not 100% because there are still events in the old, in the old pipeline, right? We are still, uh, we are still migrating them to the new pipeline. We, we have a huge speed migrating them, but still a lot of, and we, we went, uh, checking the Jira, so from two days per week, uh, fighting issues with data to half day per week, which is a huge improvement for us.

And we also, well, there were also some, some less time of the analyst cleaning and preparing the data. We hope this, uh, team lower once the immigration of all the events are are complete. And that's it. Thank you very much. I hope you you enjoy it and if there are any questions, please lemme know. All right.

So let's get into these questions, man. We got some hot ones coming through. To quantify the benefits of data contracts, roughly how much percentage of data quality issues were minimized after applying the data contracts? I think you may have put that on the last slide with those percentage of numbers.

Yeah. Somehow we, I didn't put percent of, uh, data issues, I think, but I think the, uh, going from two days per week fighting issues to half day per week is, is is old metric. Yeah. Okay. And are there, um, as you're looking at that, like I see all these numbers and I, I instantly think to myself. Are there other key numbers that you feel like, wow, this is a huge success on that you didn't put on this slide because you probably didn't wanna just keep going and make the slide too big?

Any other number? Mm, I don't know. We have, uh, I think the main issue, the main, the main number for us is the, the time spent fixing issues, both for us and for the, each of the teams in every site, which is not here, but every engineer team, every data team in every, in any of the sites in real estate or for jobs or whatever.

They also, um, spend less time now, uh, fixing issues with the data than, than, than before. Alright, let's go to the next question. Moed is asking, is implementing the data contract a trade off of data quality against data latency for batch pipeline? More time for executing the pipeline? Since we're validating the quality of the data for real-time pipeline latency and data availability at data destination, since we're validating the quality, it's an amazing question.

I don't think it's really, uh, much more latency, uh, by, by validating the schema here in the room time. But there is a lot of latency, danger of latency actually in the definition time because, because, um, before we ingest just every topic, uh, as soon as a topic is created and it has data, it will be ingested.

But right now, um, we need to create the contract, which is somehow automated. Not, not everything, but is a bit of automation, and we need to create the contract and validate it, then create the, validate the dark, deploy the dark. So in definition time, there is absolutely a risk or, or of higher la latencies.

That's why, uh, we want to, to avoid human interaction in the loop. Uh, so the, as soon as the schema is, is defined, uh, the contract will be available in a couple of minutes and, uh, that will be deployed to production in, I don't know, 15, 15 minutes maybe, or half an hour. There's another one coming through about how data contract versioning should be handled.

The end date for the contract has not elapsed yet, but there is some change that the producer wants to make to the contract. Is it typical to have two valid contracts, perhaps producing two different data sets for some overlapping period? That seems like a high cost solution, but ignoring, editing the contract after creation seems equally bad.

It is absolutely valid for us having several versions of the same contract on the, on the same event, alive at the same time. Because when in the, in the operational world, they change an, an event, they evolve an event or so, um, it is not, uh, how you say, uh, it's not an immediate change. They need to perform some tests, some issues.

They evolve the front and the back, whatever. Yeah. So, uh, we have, uh, for some events, several contracts, well, two of two contracts, maybe two version of the company alive at the same time. Hmm. Okay. So it's okay. You just said something that I haven't heard before. It's okay to have various contracts on one event.

Yes. At least for a transition period. Mm-hmm. Uh, in, in our needs, we, we have it at the end. It is true that, uh, the old event, the old version of the events will be, uh, deprecated and the contract will, will add, will need to add an, an update in the contract. But, um, for some of the events, there's a transition period and we will be ingested, uh, couple of persons of the event.

Yes. Awesome, man. Well, let me keep this going. How would you differentiate between master data management and data contracts? Whoa, I didn't get that. What is, I'm not sure I fully understand that either. Let me see here. How would you differentiate between master data management and data contracts? If that doesn't make sense to you.

We can keep moving because I got more questions for you. I can we keep moving or, uh, yeah, do we'll let Olay clarify what they're saying in the chat. Well, I ask you this next one, which is, who drove the change for data contracts within the organization? Well, uh, I think, uh, in our organization, the change was, uh, driven by, uh, both the producers and the data platform team.

I think we were, uh, the one suffering the issues with the, the old model. And we were the one, um, pushing, uh, for, for data contracts also, the each, each of the team needs to be involved. So we have, we have a lot of, um, uh, meetings about it. A lot of, uh, gathering ideas or something is not something that we as it platform impose on the teams because that, that never works for across, uh, for across team.

So we meet with them, we gather ideas and, and start pushing it. Yes. Do you think that you would have different. Metrics and things that, uh, the project cares about. If it wasn't you all spearheading this situation, like if the software engineers decided to do it, or if someone, someone else in the business decided to do it in that way.

Yeah, yeah. I, I can off for about any other metric now, but I'm pretty sure at the end, um, every team knows their, their issues and their needs and we adapt mainly without knowing our, the approach to our, our needs. No. So that's why we spend a lot of time talking with both producers, consumers, and so on to, to other, their, their needs and the requirements.

Yes. Excellent. Thank you so much for coming on here, chatting with us.

This has been absolutely brilliant, man."