Talk Description

High-quality, governed, and performant data from the outset is vital for agile, trustworthy enterprise AI systems. Traditional approaches delay addressing data quality and governance, causing inefficiencies and rework. Apache Iceberg, a modern table format for data lakes, empowers organizations to "Shift Left" by integrating data management best practices earlier in the pipeline to enable successful AI systems.

This session covers how Iceberg's schema evolution, time travel, ACID transactions, and Git-like data branching allow teams to validate, version, and optimize data at its source. Attendees will learn to create resilient, reusable data assets, streamline engineering workflows, enforce governance efficiently, and reduce late-stage transformations—accelerating analytics, machine learning, and AI initiatives.

Additional Shift Left Data Conference Talks

Shifting Left with Data DevOps (recording link)

  • Chad Sanderson - Co-Founder & CEO - Gable.ai

Shifting From Reactive to Proactive at Glassdoor (recording link)

  • Zakariah Siyaji - Engineering Manager - Glassdoor

Data Contracts in the Real World, the Adevinta Spain Implementation (recording link)

  • Sergio Couto Catoira - Senior Data Engineer - Adevinta Spain

Panel: State of the Data And AI Market (recording link)

  • Apoorva Pandhi - Managing Director - Zetta Venture Partners
  • Matt Turck - Managing Director - FirstMark
  • Chris Riccomini - General Partner - Materialized View Capital
  • Chad Sanderson (Moderator)

Wayfair’s Multi-year Data Mesh Journey (recording link)

  • Nachiket Mehta - Former Head of Data and Analytics Eng - Wayfair
  • Piyush Tiwari - Senior Manager of Engineering - Wayfair

Automating Data Quality via Shift Left for Real-Time Web Data Feeds at Industrial Scale (recording link)

  • Sarah McKenna - CEO - Sequentum

Panel: Shift Left Across the Data Lifecycle—Data Contracts, Transformations, Observability, and Catalogs (recording link)

  • Barr Moses - Co-Founder & CEO - Monte Carlo
  • Tristan Handy - CEO & Founder - dbt Labs
  • Prukalpa Sankar - Co-Founder & CEO - Atlan
  • Chad Sanderson (Moderator)

Shift Left with Apache Iceberg Data Products to Power AI (recording link)

  • Andrew Madson - Founder - Insights x Design

The Rise of the Data-Conscious Software Engineer: Bridging the Data-Software Gap (recording link)

  • Mark Freeman - Tech Lead - Gable.ai

Building a Scalable Data Foundation in Health Tech (recording link)

  • Anna Swigart - Director, Data Engineering - Helix

Shifting Left in Banking: Enhancing Machine Learning Models through Proactive Data Quality (recording link)

  • Abhi Ghosh - Head of Data Observability - Capital One

Panel: How AI Is Shifting Data Infrastructure Left (recording link)

  • Joe Reis - Founder - Nerd Herd Education (Co-author of Fundamental of Data Engineering)
  • Vin Vashishta - CEO - V Squared AI (Author of From Data to Profit)
  • Carly Taylor - Field CTO, Gaming - Databricks
  • Chad Sanderson (Moderator)

Transcript

*Note: Video transcribed via AI voice to text; There may be inconsistencies.

" We are coming up on our next talk, so I'm gonna bring Andrew to the stage now. He's going to be talking as you can see on this nice little card all about shifting left with Iceberg. How are you doing Andrew Demetrios? I'm doing very well, thank you. Well, I can appreciate a good mic. I see also some practical lights back there.

It looks like it's not your first time on camera. I'll let you share your screen and get rocking with this talk. Awesome. Thank you very much. And I'll keep it quick 'cause I know. We're on a schedule here. I don't wanna mess you up. That's it. You, I have one job. Well, thank you so much for having me. Um, so I'm Andrew Madson.

I'm head of Evangelism and Education at Topeka, uh, which sponsors Open Source projects. I'm also the founder of Insights by Design. And one thing that we can't get rid of is ai. It's everywhere, right? And so I'd love to just explain a little bit more about that and how Shifting Left really does power ai.

Just as a little bit of context, I've been in data for quite a while as a data leader, um, leading data teams at all different sizes. I'm a professor of, uh, graduate data science programs at a bunch of different places. If you want an a, take my course, I'm super easy, greater, uh, and feel free to connect with me on, uh, social media to continue the conversations there.

Got an open source Slack, if you want a free book. Get that about the technology that we'll be talking about. So what's the opportunity? Like? We see AI all over the place. We can't really get rid of it. Just AI is constantly in our face. And now we have Vibe coding, which I love. I love that they came up with the term vibe coding, but why is AI really everywhere?

And Jeff Bezos talks about how solving problems with this new technology is really helping us envision new ways of working, new problems that we can solve, uh, things that we hadn't even thought of before. And, you know, Gartner is reporting that at least 77% of organizations, uh, and they just updated that, that's why my numbers don't match.

Used to be 75, but now they've said 77 rank a's AI is one of their top five investments. So there's really money being poured into the AI infrastructure and why, when it's a top down initiative, similar, like move everything to the cloud, the digital transformation, uh, which was a top down board driven initiative, there's always something behind it.

What are we trying to get out of it? Well, innovation, 78% of executives believe AI will have a very high impact on innovation, productivity, and of course, money. Right? When things are coming top down, we expect that revenue is somewhere behind it. But when we say ai, what do we actually mean? Uh, you know, I'm old enough to remember when AI really was, uh, an all encompassing, uh, category that included machine learning.

It was traditionally deep learning for a while, but now often when, uh, we hear AI folks are talking about generative ai. And we see in the news, all these models, all these new models coming out, deep seeks coming out. Uh, and when I was at different organizations leading AI teams, we were, you know, I'm guilty of this with being a data scientist myself, really focused on like, what, what's the best model we can choose for this situation?

How can we optimize the model? How can we track model drift? How can we improve it? Should we do this one? Should we do that one? Oh, now here's a new one. Let's do that. Super model focused. And the model's incredibly important, but just as important is the data. The data and the model are together your AI system.

And you need to have the exact same amount of rigor and process in place for the data as you do with your model. And the data often gets left behind. And we just hope, you know, we have hopes and prayers and dreams that our model will be so awesome that it'll overcome our crappy data. But that doesn't work.

It doesn't happen. So this is why shifting left really will help drive AI and really analytics and, and data science as a whole forward.

And just as an example of this, this comes from a, a paper, which I can link later. I'm embarrassed I didn't put the link here. I apologize for not attributing it. It comes from Rice University. And just looking at how the model of GPT one through chat, GPT and GPT-4 progressed, if you remember, GPT one, if you even had a chance to play with it, not very good, uh, not super intuitive.

It could do something great. You know, transformer models were fairly new. Uh, awesome. Trained on less than five gigabytes of raw data to get to the second iteration of the model. 40 gigabytes of human filtered data comes out. And what can it do now? It can do a little more, it can do some math. It's a little more useful.

We get to GPT-3, which is heavily curated, 570 gigabytes. So a wide scope, a wide amount of data, heavily curated down from 45 terabytes of raw data. And GPT-3 0.5, I would argue is really the turning point where folks started, the public as a whole really started saying, oh, I can see a huge benefit to this generative AI thing.

Then we get to four. Right now we've got human labeling annotations, and it's, it's incredibly helpful. But between one and four, what changed? All of the models are fairly similar in that they're transformer models. I. What really got us from one to four? Well, we increased the data size and the data quality, which are incredibly important.

And I believe that's the same for most organizations, that if we focus not only on the model, which is insanely important, is getting that model right, but improving the size and quality of our data, we'll be able to then have a robust AI system that will achieve those goals that we looked at earlier.

Innovation, improving efficiency, driving revenue. You have to have both the data and the model. And often, you know, we're, we're trying to optimize a system that's not quite mature yet. Uh, I love Donald Nth, the, the father of analysis algorithms. Premature optimization is the root of all evil. I feel like it's, we often prematurely try to optimize our models or our systems before we really have those robust practices in place.

The data contracts, the shifting left, and understanding our, our organizational needs to have a robust AI system on the backend. So Gartner just recently, a month ago, came out and said that they're predicting that 60% of all generative AI projects started in 2025 will fail by 2026, specifically because of data quality.

And, you know, every, anytime I see a statistic that ends in like a five or a zero, I'm immediately skeptical, but we'll take it. And, you know, we've, we've heard the statement, the off quoted statistic, but never attributed statistic that, you know, 90% of data projects fail. And why do they fail? Well, if you've worked in data longer than five minutes, you know that there's data silos, there's poor data quality, insufficient metadata.

Finding data access is a problem. You know, if you're looking for a specific field, if you're looking for customer SSN customer address, you type an address into de Beaver or whatever you're using, maybe you're using a data catalog and then a thousand tables pop up and they have different information.

Well shoot which one's accurate, which one's the right address. Uh, governance can either be too strong, um, in that it's rigid and you can't get access to the data or it's insanely slow or it's non-existent. Maybe we've, uh, gone too far the other way and just there is no governance. So there's challenges with making our data quality high and shifting left solves a lot of those problems both organizationally and technologically.

Why shift left? Well, there's a lot of benefits to it. So, shifting left. If you think about what a traditional data pipeline looks like now. In general, uh, let's take a ETL pipeline. So we're doing extracts from sort systems. We're modifying, uh, via spark some other engine, the data, we're landing it somewhere and we're, we're calling that our bronze layer.

Then we're transforming it again and we're landing it somewhere else and we're calling that our, our silver layer. Then we're probably transforming it again and maybe even landing it somewhere else. Calling that our gold layer. And it's often we find the issues either in the silver or gold layer. Well, by then, you know, it's at the very end of the pipeline that we're finding these issues.

Well, that's not very good. Or even worse, maybe it gets all the way to the stakeholder and then we discover, oh shoot, this data's wrong. We've got a problem. Uh, and then we have to work that all the way backwards to the left of where did our data quality go wrong? So, in a shift left model, we can implement early data quality checks.

Um, for example, you know, think of if a column's missing from sensor data. If we have like a data contract in place and or some tests, we can, we can identify that super early and the folks who are working with it and responsible for it, often data engineers closely associated to the business, they can fix it right away.

They can have continuous testing and validation. If you think back to, you know, what expanded chat GBT from one to four, quality and quantity, right? So that continuous and automated testing and validation is essential for a a successful AI system. You can collaborate better in a shift left model. Um, 'cause you're involving data experts, data engineers, data stakeholders early, not at the very end.

Uh, when we discover, oh, shoot our, our dashboard's wrong, our model's wrong, we've got drift everywhere, and we can't super figure out why. 'cause something happened early in the pipeline. Uh, it helps you avoid last minute surprises, uh, that nobody wants. And it, it, those could also impact your AI accuracy. We, we track model drift, but we also need to track data drift and especially early in the pipeline.

It's super helpful for, especially in ai, faster model development, because if your data's consistently reliable, you can train and deploy AI models quickly and with fewer rounds of testing and revision, you don't have to pause because of unexpected data errors and discrepancies later in the pipeline. So shifting left lets you treat data at the same level and carry that you give the AI model itself and the result for you is a smoother path from raw information to insight, which really allows the AI projects to stay on schedule in budget and produce more accurate outcomes, everything people want, right?

And that's really the shift left, and that's why I think it's so important, but also common. I'm seeing a lot of organizations consider how they're structured technologically and organizationally, and a lot of them are one federating. Uh, analysts have been federated often for a while, meaning that they're more aligned with the business units.

And now data engineers are also expanding where they are more aligned with business units, um, either an analytics engineer, uh, or a data engineer themselves. And this really helps us shift left and create that smooth path. So you can do this with open source products, right? And I believe that the data product structure, meaning that you've got accountability, discoverability, ownership, lineage, uh, quality, just that, that structure of a data product, that idea of a data product, you know, building it organizationally and technologically is the best way to power ai.

Why? You're gonna deduplicate data pipelines. You can build once use across the whole enterprise. Add your data quality checks early, which we, we looked at, uh, just a second ago about why that's important. So having a data product framework. But what's the best way to do a data product if you're a very distributed company, meaning your, your data estate very distributed, you've got this warehouse and this warehouse and this database and this data lake per perhaps difficult to integrate that data and create a robust data product, especially as our, our deliverables on the backend require more data from across these different sources.

This is where Apache Iceberg comes in, keeping an eye on that, that time. Demetris, don't you worry, I'll wrap it up. So Apache Iceberg is a, a open source table format, meaning that on your data lake you can take your file formats like Par k, you can organize them as tables. So if you just told your data analyst, Hey, all these JSON files sitting in, in this data lake are really a table, they don't know that you can't see, that you can't work with them.

Iceberg allows you to organize your data on your data lake as a table, which helps you create that one central repository, which is super helpful for creating, uh, data products. Which are super helpful for powering AI and data deliverables. You know, you can have an open lake house, meaning that there's lots of open source technologies, different engines, spark Flink, Trino, OpenTable formats.

I'm highlighting Iceberg. But there's also Hootie, there's Delta Lake, and then open metadata catalogs. You've got Project Nessie, Apache, Polaris, lots of catalogs out there. So you could have an open Lakehouse architecture to free you up. So it really helps prepare you for this AI ready data because it gives you that unified data foundation and integration.

This really, I've seen in the wild really help for model training. But you may be asking yourself, but what about vectors or graphs? Well, let's skip there. We gotta keep Demetris happy. So I call this the core Explorer, and this is what I've seen work, um, for companies who are pursuing this type of approach, they make the Lakehouse, um, an open Lakehouse format for their architecture, the center of their data architecture.

They really are trying to centralize their data in one spot that they can do their transformations directly on the lake. We can see all the transformations right there. It's controlled. Uh, then we've got ownership. It's not moving around as much. We're avoiding egress fees, all those great things, right?

And then you can do explore. If you need a a time series database for instance, then you need one. There's certain functions that a time series database does that other databases don't do. Okay? Get a time series database for that use case. You may need vector or graph databases, great, but those are periphery to your central core that's running your analytics and your AI operations.

And then you've got these one-off solutions as needed, but it helps you track it, uh, more closely helps with ownership and reduces the size of your data estate while remaining flexible for those edge use cases. So that's my presentation. I tried to keep it on time demetris, but you know, essentially AI requires both a robust model and robust data.

And shifting left will allow you to build data products and increase the quantity and quality of, well increase the quality of your data and then by moving towards an open source data lakehouse, that helps you increase the, the quantity of your data so that open stack combined can really help you, uh, power ai ml and really analytics who I went fast, Demitri all, I just gotta say.

Uh, thank you for keeping the time and obviously awesome talk. Uh, this may be a little weird, but I'm going to say it anyway. I'm gonna go out on a limb. Have you thought about some data A SMR before? Is that I haven't, but now that's all I can think about. Which gimme your YouTube channel. That's all I need right now.

That is what I'm looking for. That's right. But I'm sure people got real questions coming through, uh, about this talk and I, I thoroughly enjoyed it. I know that there was some that were, um, you know, you showed some, some stats up there where it was like, if you see a round number, it kind of Right. It makes the red flags go up in your head.

Definitely. And I'm, I'm the same way where I'm just like, where are we getting these stats from? Where is this data? And I think one of the hardest things with AI in general is that a lot of the quote unquote benefits that we're seeing out there are self-reported. Mm-hmm. And so you have these surveys that go around the company and they say, are you using AI?

And how much benefit are you getting? And it's like, uh, a lot I guess. Yes. Yeah. The top three use cases, I think this is from, um, Forrester, so don't quote me on that. I think it's from Forrester though. But the top three use cases from their recent survey were one coding copilot. Yeah. Um, you know, that makes sense.

Within, at, at an enterprise level. So that's who they surveyed were enterprises coding copilot number two was, um, automation. So just automating rote uh, processes, procedures. Great. We all love that. And then third was as a chat bot. Mm-hmm. So those are right now the, the top three use cases for AI within an organization.

But I think the rollout of a agentic AI actually makes it much easier to find use cases to roll out. I think AG agentic AI is, is actually a lot easier to roll out than trying to do a full, uh, you know, fine tuned or rag model within your organization. One thing that I'm banging the drum about all the time is the chat and the prompting is not the way that I enjoy interacting with any type of tool really.

I don't want to have to know if the AI didn't do what I wanted it to do because I am not a level five sourcer in prompting. I just want to know, like, did, did it not run because of something that I can change or is it inherently not possible? The prompt engineering side. So you're not a vibe coder?

Demetris I, despite looking like it, I am not a five coder. You know, you bring up a good point 'cause it's a skill gap. Um, you know, if you're. If you kind of know what you're doing with programming coding, then AI can definitely be a, a very great efficiency for you. Um, it can help you, it can help you learn new languages, those type of things.

But it's because you're applying some of that foundational knowledge that you already know, like, oh, how do I do this function? Like, you, you know what that should look like. It's helping you. If you don't have that base foundation and you start relying on AI to do all of your coding, uh, that, that can be a big challenge.

So you really kind of want to know step by step, like what do these different pieces do before plugging into ai and then having that as a co-pilot. Actually, that kind of ties into a bit of your talk on the data side of things, and I've got a friend who likes to talk about how we have to know what good looks like so that we can catch when it's not good.

And to know what good looks like. You almost have to be that expert. And if you're trying to shift left AI or get closer to the source, you gotta know what exactly you're trying to go for there. Yeah, that's true. And you know, it's general kind of generic use cases AI really crushes at, but if you have a more nuanced or a bespoke solution that you really need to accomplish, uh, that it becomes even more important that to your point, you need to know what good looks like because the AI is more likely to screw up on an edge case and you really need to know, did it or did it not screw up?

I I treat it like a junior programmer and I still need to check the code, but, you know, if the programmer's doing awesome, it saves me the time. I don't have to write it, but I still need to kind of check the code and make sure it's all good. So what are your thoughts regarding data scientists and data engineers navigating so many new tools in the field?

For example, investing time with specific databases.

Can you gimme an example with databases? What kind? Yeah, so trying to invest time to get familiar with specific databases such as MongoDB, iceberg, and many new ones coming out every day. How do you look at which ones to play around with and, and really go deep dive on versus just ones that you say, oh, that's cool, and then you go by in passing.

Yeah, and I don't think there's a, a, a magic eight ball as to what you should be doing, but if you have engineer in your title, like you've signed up for a lifetime of learning, we're, we're gonna constantly be learning. Um, you know, there's some technologies that have been around for a long time that I've never had to use because of use cases, but I think if you see something and um, you think you, you will likely use it in the future, you're exploring it, you're interested in it, then you really should.

Start digging into it. Apache iceberg's an example. Um, if, if it's, it's really, if you're already in the field, um, looking at new use cases and trying to improve operational efficiency and effectiveness, uh, we'll include that learning. Hmm. Incredible man. Well, we're gonna keep it moving on."