For the past two decades, data engineering has evolved at breakneck speed. We went from rigid, structured relational databases to vast, schema-less lakes of unstructured data. We praised this newfound flexibility, hailing “schema on read” as the ultimate answer to handling diverse and fast-growing data volumes.

In hindsight, we were wrong.

Schema on read was a necessary evil given the explosion of data sources, formats, and use cases, but implementing it before evolving our approach to data management was a strategic mistake. It created an industry-wide problem where developers were freed from the constraints of a predefined schema but left with no guardrails to ensure their data was actually usable.

The result? A brittle, costly, and inefficient data landscape where downstream consumers are constantly firefighting broken pipelines, missing fields, and incompatible changes.

This isn’t just an academic debate. It’s the modern equivalent of “it compiled” before unit tests and CI/CD. The same problem that plagued software engineering in the pre-testing era—code that “worked” in that it ran but failed spectacularly in production—is now rampant in data engineering.

a schema dramatically shredded, source DALL-E

The Database Used to Be a Data Contract

Before schema on read, we lived in a world of strict, enforced schemas. If you were writing to a relational database and attempted to insert a row with ten fields into a table with only four columns, it wouldn’t work. If a field had a NOT NULL constraint, you couldn’t insert a record without that field. If a foreign key constraint was violated, the database stopped you cold.

In other words, the database was a poor man’s data contract and unit test, baked directly into the data-writing process. Developers had immediate feedback—their code was either compatible with the schema or it wasn’t. They didn’t get to just write data in whatever format they wanted and let someone else figure it out later.

The move to schema on read changed this dynamic entirely. Developers could now write whatever they wanted into a data lake or NoSQL store. No constraints, no validation, and no immediate feedback. As long as the system accepted the write, everything appeared to work.

The problem? This was only true for the writer.

“It Works Because It Wrote the Data” Is Not Good Enough

Just because a system allows you to write data doesn’t mean that data is usable. But this fundamental principle was lost in the transition to schema on read.

Developers no longer had to think about downstream consumers when writing data. They didn’t need to worry about whether a field they renamed would break a dozen reports or whether a change in data structure would crash a production job. Data teams were left holding the bag, expected to somehow clean up the mess later.

This is the “works on my machine” problem, but for data.

In software development, we solved this problem with automated testing, CI/CD pipelines, and deployment validation. We realized that just because code runs doesn’t mean it’s correct. Yet, in data, we’re still stuck in a world where developers write whatever they want, and data teams have to hope it all magically works out downstream.

Spoiler: It doesn’t.

The Cost of Post-Hoc Data Validation

Schema on read put the burden of validation after the fact—not at the point where data is generated, but when it is read. This was a fundamental inversion of responsibility, and it came at an enormous cost:

  • Endless firefighting – Data engineers spend their time reacting to broken pipelines, missing fields, or unexpected data formats rather than proactively ensuring quality.
  • Slow time to insight – Analysts and data consumers have to reverse-engineer the meaning of poorly documented or inconsistently formatted data.
  • Brittle data infrastructure – A simple upstream change, like renaming a column, can silently break a dozen downstream consumers because there was no enforcement at write time.
  • Hidden data debt – Every ad hoc transformation, undocumented schema change, or patched-up pipeline adds to the complexity of the data ecosystem, making it harder to maintain over time.

According to Gartner, poor data quality costs organizations an average of $12.9 million per year in operational inefficiencies, lost revenue, and regulatory risks.

This is the equivalent of hoping that a small team of QA engineers can magically fix all the bugs in an application after it’s been written rather than preventing them at the source.

A Real-World Example: Glassdoor’s Data Crisis

This isn’t just a theoretical issue—it has real consequences for large organizations. Take Glassdoor, for example.

For years, Glassdoor’s data platform relied on fragmented, ad hoc solutions to manage data quality across various departments. Their data engineering teams, split between B2B, B2C, and GTM functions, struggled with inconsistent data ownership and poor visibility into lineage.

The result? Critical executive dashboards surfaced inaccurate data, and the CEO himself started questioning pipeline issues.

One of Glassdoor’s attempts to fix the problem involved anomaly detection—but it was built as a weekend hack project by an intern and never evolved. Without a systematic, proactive approach to data quality, the company was stuck in a cycle of reactive firefighting.

The turning point came when the engineering team realized they needed a structured, enforceable handshake between data producers and consumers. That’s when they discovered data contracts.

By shifting validation left and enforcing data contracts at the producer level, Glassdoor eliminated the guesswork, reduced broken pipelines, and established a scalable framework for reliable data.

Data Contracts: The Missing Piece

What’s the solution? Shift left.

Just like software engineering had to shift from “test after deployment” to unit tests and CI/CD before deployment, data engineering must move validation upstream to the data-generating code itself.

This is where data contracts come in.

A data contract is essentially a distributed unit test for data. It enforces:

Schemas – Defining what fields exist, their types, and any constraints.
Compatibility checks – Ensuring that changes won’t break downstream consumers.
Validation rules – Catching anomalies before they propagate into production.

With a data contract in place, you can’t break downstream consumers any more than you could violate a SQL constraint in a traditional relational database.

Shifting Left: Fixing the Mistake of Schema on Read

The industry went about this all wrong. We embraced schema on read without first building the equivalent of CI/CD for data. We removed the enforcement of schemas without replacing them with something better. Now we’re drowning in brittle, unmanageable data.

It’s time to correct that mistake by making data contracts a first-class citizen in data engineering:

  • Define schemas at the point of data generation – Before data is written, it should be validated against a predefined contract.
  • Enforce compatibility – Changes should be versioned, tested, and verified before they go live.
  • Automate data validation – Just as we have automated tests for code, we need automated checks to ensure data meets expectations.
  • Integrate contracts into the development workflow – Data validation should be as fundamental as compiling code or running unit tests.

Conclusion

Schema on read wasn’t inherently bad—it was necessary given the explosion of data sources and formats. But the way we implemented it—removing constraints before establishing a replacement—was a mistake that has cost us dearly.

Glassdoor’s story proves this in action. They eliminated broken pipelines and data ambiguity by enforcing data contracts at the producer level.

Gable makes this process seamless, automating contract enforcement at the application layer—so data teams never have to firefight broken pipelines again.

📘 Get the step-by-step framework for implementing data contracts at scale.
👉 Download The Data Leader’s Guide to Implementing Data Contracts as Code