Talk Description

In healthcare technology, protecting patient privacy while scaling data operations requires reimagining where quality and governance live. This presentation explores Helix's journey of shifting critical processes left in its precision medicine business—from implementing automated data classification and privacy workflows to enlisting cross-functional expertise in refining operational workflows. For clinical data management, we've partnered with healthcare systems to implement OMOP standards and data contracts at the source, creating a robust foundation for research and commercial opportunities. Through practical examples, we'll demonstrate how this upstream approach has transformed our data operations, encouraged internal alignment, and strengthened partner relationships.

Additional Shift Left Data Conference Talks

Shifting Left with Data DevOps (recording link)

  • Chad Sanderson - Co-Founder & CEO - Gable.ai

Shifting From Reactive to Proactive at Glassdoor (recording link)

  • Zakariah Siyaji - Engineering Manager - Glassdoor

Data Contracts in the Real World, the Adevinta Spain Implementation (recording link)

  • Sergio Couto Catoira - Senior Data Engineer - Adevinta Spain

Panel: State of the Data And AI Market (recording link)

  • Apoorva Pandhi - Managing Director - Zetta Venture Partners
  • Matt Turck - Managing Director - FirstMark
  • Chris Riccomini - General Partner - Materialized View Capital
  • Chad Sanderson (Moderator)

Wayfair’s Multi-year Data Mesh Journey (recording link)

  • Nachiket Mehta - Former Head of Data and Analytics Eng - Wayfair
  • Piyush Tiwari - Senior Manager of Engineering - Wayfair

Automating Data Quality via Shift Left for Real-Time Web Data Feeds at Industrial Scale (recording link)

  • Sarah McKenna - CEO - Sequentum

Panel: Shift Left Across the Data Lifecycle—Data Contracts, Transformations, Observability, and Catalogs (recording link)

  • Barr Moses - Co-Founder & CEO - Monte Carlo
  • Tristan Handy - CEO & Founder - dbt Labs
  • Prukalpa Sankar - Co-Founder & CEO - Atlan
  • Chad Sanderson (Moderator)

Shift Left with Apache Iceberg Data Products to Power AI (recording link)

  • Andrew Madson - Founder - Insights x Design

The Rise of the Data-Conscious Software Engineer: Bridging the Data-Software Gap (recording link)

  • Mark Freeman - Tech Lead - Gable.ai

Building a Scalable Data Foundation in Health Tech (recording link)

  • Anna Swigart - Director, Data Engineering - Helix

Shifting Left in Banking: Enhancing Machine Learning Models through Proactive Data Quality (recording link)

  • Abhi Ghosh - Head of Data Observability - Capital One

Panel: How AI Is Shifting Data Infrastructure Left (recording link)

  • Joe Reis - Founder - Nerd Herd Education (Co-author of Fundamental of Data Engineering)
  • Vin Vashishta - CEO - V Squared AI (Author of From Data to Profit)
  • Carly Taylor - Field CTO, Gaming - Databricks
  • Chad Sanderson (Moderator)

Transcript

*Note: Video transcribed via AI voice to text; There may be inconsistencies.

" Awesome. Okay. Hi everybody, I'm Anna Swigart.

I am director of data engineering at Helix. Um, and so today I'm gonna talk about how we're approaching building a scalable data foundation in health tech. So we're a health tech startup. uh, so I'll give a little background on what Helix does and then dive into lessons learned, um, about three different aspects of our business and we'll, we'll get into the details there.

Um, so kind of backing up a little bit and thinking about like, why, why are we talking about genomics? You know, it's recognized that there are many contributors to an individual's health, including behavioral and environmental and social factors. And really genetic factors are also a known, uh, major contributor.

But having access to this data is very limited in today's US healthcare system. So, despite a wealth of research, um, genomics hasn't quite made it into the standard of care yet. And this leaves physicians and patients with a missing critical link that, uh, could aid in their decision making. So patient care journeys are often kind of a bumpy ride or feel broken altogether without that information.

So Helix, uh, what we do is we partner with major health systems across the United States to offer an innovative new approach to integrating genomics into patient care. So this kind of program is often described as a key component to a health systems precision medicine strategy. And at Helix, our, our innovative sequence once query often, or a SoCo approach as we call it, um, this allows us to digitally reuse genetic data for many kinds of tests over a patient's lifetime.

Um, often starting with a broadly recommended set of tests that the CDC, the Center for Disease Control, um, has, has designated as tier one for having strong evidence and notable potential, um, to improve public health. So additional diagnostic panels, um, for early detection, diagnosis and management of things like hereditary cancer, cardiovascular, and other con conditions are, are, are other kinds of tests that that, that might be performed, um, as well as, uh, pharmacogenomics, which is, uh, basically it allows us to understand drug efficacy and risk of side effects based on an individual's unique genetic profile in order to determine kinda optimal prescriptions for them.

Okay, so now let's get into some of the data. Um, data is really at the core of the Helix business, um, in the nucleus, if you will. Uh, broadly speaking, there's three, three parts to the business where data has the most impact. We have, um, operations. This includes, um, the umbrella of enrolling patients into research programs, tracking samples through our lab and other software systems, and tracking different kinds of laboratory measures.

Um, in addition to the clinical workflows that, uh, return results to patients and providers. Genomic data is prepared for research use along with clinical data and on participants who who've consented to research. And, um, on the commercial side, clinical and genomic data are used in, uh, life science partnerships for applications like drug discovery and for clinical trials and, um, to inform broader analysis of health outcomes.

So, um, so now you might be thinking like, wow, that's a lot of exciting work happening in this very regulated space. How do you, how do you navigate all the, the compliance requirements there? Um, and the, the, the answer is you embrace them. So, so maintaining, maintaining trust is, is incredibly important in an industry like healthcare.

And if you've lost trust, you've, you've, you've already lost the business. Um, here, here's a few of the, the kind, the rules that we have to follow as an example. Um, so which, you know, in addition to being the right thing to do for patients, arguably also have, uh, uh, a positive impact on the overall data management quality and security posture, um, that, that a regulated company might have, uh, as compared to maybe a, a non-regulated company.

So, um, under HIPAA. We have to be really intentional about limiting collection, storage and, uh, sharing of PHI or personal health information to what's essential for the business. Um, there's a regulatory component to, uh, data quality for services that are in the critical path of delivering a clinical results.

So, um, the test is considered a medical device. So this is great news for all of us as patients in the healthcare system, uh, that, that, that there's a high, high bar to quality for, for those clinical results. Um, you know, in some cases, patients opt to have elective medical procedures performed as an outcome of their genetic test result that might be, you know, might be invasive in nature there, including things like prophylactic surgery, um, and, and sensitive data like, like PHI and genomic data really require very rigorous, uh, protection.

Um, so how do you, how do you build a foundation that is secure, uh, compliant and also nimble like we are a startup after all? Um, I, I think I think of this work in three categories. We have, uh, table stakes capabilities, uh, removing reliance on manual work, AKA automation and, uh, enablement of broader applications.

So, um, you first have to decide as an organization, so starting at the bottom here, like how do you break up the data space? What domains exist in your space? Um, how do you map access levels to least privilege and data minimization principles? You, um, you also need to catalog, uh, you need a catalog and, and a method of understanding what data you actually have.

And importantly, this needs to include not only the analytics data layer, but also the data source owned across teams. And, um, then in this automation bucket, uh, you streamline tasks that, that operate on data categories and roles to, uh, to provision access, to assess data quality, to carry out important privacy workflows, um, like Ds a data or data subject access requests.

And, and then at the top of the pyramid is where, uh, significant investment in, um, processes like implementing find grade, access control and um, robust, uh, de-identification workflows, those can really unlock much more value, um, from the data in a safe way. So in practice, here's an example of how we've put these, some of these components together.

Uh, at Helix to drive value, we, uh, we leverage a tool called syra to scan and classify data fields across our cloud environment. So the tool identifies data, subject context, and, uh, data classes and maps that to our internal sensitivity levels. Um, we, we then take this metadata, ingest it into our, our, uh, data catalog.

In this case we're using data hub. Um, and then, and then automation is used to create. Uh, to generate fine grain access control policies in our data warehouse to dynamically mask sensitive fields, um, based on access level. Um, the, the metadata also can be used to enable comprehensive, uh, DS a R processes, helping the upstream teams identify data elements that are in scope for, uh, carrying out privacy requests.

So you're, you're shifting the responsibility to them, but you're giving them tools that to make them successful. So, um, so now that you have a good handle on the, the compliance considerations, let's talk about the data quality, um, aspect. So data quality in, in, and operational data in particular. Um, and, and how to bring people along as true partners in that process.

Um, and maybe first before we, we get into the details of, um, of, of how to pick your, pick your contracts and, and, and what to implement. Um, I'd like to paint, paint a bit of a picture for you of something that I like to call, um, the data health maturity curve. So it happens to be a healthcare analogy, but, um, imagine it in the context of a data problem that you're facing and, and let me know if it resonates.

Um, so most of us start paying attention to data quality because we find ourselves in this reactive triage state. It's where, you know, there's a problem that's gotten big enough to, um, for a human to notice, uh, and, and cause alarm and, um, become an emergency. So this is basically like you're in the er, you're, you're kind of like under the line of, of the maturity curve.

Um, and, and, and maybe then, you know, you then you graduate to having detection. So you, um, you implement data contracts on the things that you know, might be problems. Uh, um, you turn on some alerting and you pat yourself on the back that, you know, you're no longer being quite so reactive. Um, but then, you know, contracts start failing.

Um, they're actually doing their job. That's great. Um, and you focus on categorizing the issues into being, uh, able to get to the root cause. Um, really effectively you're understanding, you know, you're understanding the, the, the cause of the issue. And, and that's, and that's nice. That's, that's helpful, but you know, it doesn't really provide business value until, until you, um, are able to identify the, and, and apply the, the appropriate treatment and maybe, maybe even through automation if you're, if you're like really advanced.

Um, so then, so then you're treating symptoms all over the place. You, um, you realize that like what you really want is for the alert volume to be lower in the first place. And, uh, how can you prevent bad data in code from, from wreaking havoc, um, in the system. And of course, this is, this is the space that, that, uh, gable is, is, um, is working in a good, an important problem to solve.

Um, and then finally, you know, there, there's an idea of investing in proactive wellness, which may be, uh, maybe you normally try to do on a best ever well basis, but, um, sometimes a much larger investment and commitment is, is really needed to truly, uh, create an effective data model and data enrichment strategy that can, that can keep up with your business.

So, um, so how do you, how do you navigate this, this maturity curve? Um, okay, so to understand where data expectations or contracts should be established first, you wanna identify and codify the metrics, um, most important to your business operations. And, um, for, for Helix for our company, you know, we, um, really turnaround time tap is, is, is our, is our North star operational metric.

It's like, did the, did the workflow finish and the time it was expected to, and, and this is really important for trust, um, with partners and, and it being established as a reliable lab. Okay? Then you have, um, you have your metric, but, but next you wanna understand, you know, what can hurt this KPI At Helix, we, we analyzed, um, on-call issues across engineering teams to understand what kinds of problems recur, um, or can be caught earlier in the operational flow.

And, and if there's an any level of, uh, confusion about the workflow and how that's represented in the software and interacted with by people. Um, a great strategy is to, to build a shared understanding of the domain events and business workflows through cross-functional sessions that, um, um, often through what we did is through a framework called event storming, which, um, is often leveraged in, um, domain-driven design, um, cases.

So, um, event storming can be used to understand and document what's exp, what the expected workflows are, and what the current ones are, and, and create cross-functional alignment and artifacts that can be referenced later by the team. So the idea is to bring together software developers and domain experts to learn from each other, um, and Helix these sessions span, you know, different, different teams including engineering, product analytics, solutions, um, operations people.

And, um, it can actually make a lot of sense for a data person to lead this kind of session too.

So, uh, so as you're thinking about data contracts, uh, it can be a good practice to start with a core set of checks that I like to call vital signs, um, again, with the healthcare analogies, but bear with me. So, so think about this, think about like. When you have any kind of interaction with the healthcare system, there's, there's a reason they always start with the same few measurements first, right?

Like they, these can be reliably indicative of, um, of other major problems. So just like a provider would, would take your temperature and your blood pressure before ordering an MRI procedure, um, you should check to make sure that you know the data's there in the right shape, um, before, uh, before running any complicated, expensive kind of validation on it.

So, um, to set the thresholds for these kinds of checks, consider questions like, um, which fields are needed to support critical workflows? What happens in the business if, um, the data's unavailable? How does one record relate to others? Things like that. Um, and for anything that you would consider a breaking or critical contract, you know, run those, run those upon ingestion or, um, ideally, um, um, upstream in, in, in production.

So beyond the vital signs, you want data contracts that address your most critical business logic. So it's worth noting that not all data validation makes sense on right source system values may, um, be optional for some use cases, but required for others. And conditional contracts can really be a good way of capturing that nuance.

Um, you can draw on workflow expectations that you came up with in your event storming sessions to check for things like valid states under particular conditions, um, and, and the order of events that are occurring. Um, and you can also make grade strides toward your data wellness goals, um, by working with product and software teams to, to optimize how much of the complexity can be handled in upstream data production, um, and service design.

Alright. All right. So, um, so finally I wanna talk about our research data pipeline a little bit and, and touch on how Helix has been able to shift a lot of data quality and interoperability efforts upstream in our health sys, uh, to our health system partners, um, uh, by leveraging data standards within our industry.

So, so Helix currently ha uh, partners with 16 health systems. These are like collections of hospitals. Um, they, um, so the health system will send us EHR data as part of HRN or the Helix Research Network Program. Um, this data is then leveraged in novel scientific research, both by our research team and by researchers at our partner sites.

Um, it's used to give insight about health factors and outcomes back to the health systems where, where these folks are patients and, um, and as part of life science, uh, partnerships. So we enlist each partner to, um, to take on the normal, the data normalization work for their data, while supporting, um, insights into the quality metrics back, back to them with each submission that they make.

Um, on the tech side, the EHR records usually come from, uh, a system called Epic. Um, in most cases, this is a, a common software platform that manages medical records and other administrative processes in hospitals. Um, and then we have serverless spark workflows, uh, through AWS glue that, um, allow us to, uh, process these large data sets in a, in a Lakehouse architecture.

So, uh, source data from health records can be incredibly heterogeneous, even within the same EHR system. So there's actually, there, there's quite a need to standardize and, and have, have a standard model for this. Um, there's this interdisciplinary community called Odyssey, that's how you pronounce that acronym, uh, the Observational Health Data Science Sciences and Informatics Group.

Um, and, um, they've developed an open source model called, uh, an open standard called the mop, CDM or Common Data Model. Um, so the, the M mop is basically a person-centric relational data model designed with analysis purpose. Um, it has standardized vocabularies for, for things like medical terms and, uh, tech, it's technology agnostic and, uh, provides providence, um, from the source data.

So we're really grateful to have this standard and, and we're active active participants in, in that, in the Odyssey community. Um, and then to help health system partners contribute the cleanest and most usable data that they can. Uh, we run a large suite of, of data quality checks on ingestion and then after enrichment as well.

So, um, we follow a data quality framework described here, um, that's also been leveraged with some of the, the data quality tooling. Um, the Odyssey maintains, um, that's called the, the con framework. The, there's basically key, key data contract categories that include. Uh, conformance. So this spans across like values, relationships, and derived value computations, um, completeness.

So are particular elements present at an expected frequency and, uh, plausibility like, are the, are the values actually believable? Um, so that's been, that's been super useful and something that we've been kind of using as we, as we develop our own tests. Um, so here, here's kinda a fun, real world example to bring this to life a bit.

Um, I, so a, a type of plausibility check that you might be interested in could, could relate to the, the relative frequency of a certain kind of record. And, and at one point, um, with an EHR data set that we received, we, uh, we saw this surprising summary statistic that it appeared that 30% of our cohort had this diagnosis id that, um, represents, um, an ICD nine code, uh, that, that says that, you know, they, they were a driver of a bus injured in a collision with a two or three wheeled motor vehicle, um, in a non-traffic accident.

So this, this seems unlikely, right? Um, and, and, and so ICD is this inter is the International Classification of diseases, um, another controlled vocabulary for the wind. Yay. Um, but, but in this case, it turned out to be, uh, this turned out to be a vocabulary version mapping issue. So there was, uh, an ICD 10 code.

The next version, um, had the same, the same ID mapped to examination of eyes and vision, which is, you know, a much more common occurrence than the previously described one. And it's, it's a lot more, a lot more likely to be the right thing. Um, it turned out the ICD version had been switched over in 2015 and, um, the, the, the partner had just like switched the timeframe and vocabulary mapping in their logic.

So we were able to, able to resolve that one. But, uh, all kind, all kinds of things come up. So, um, yeah, so in summary, kinda to bring it all back together, like the, these are, these are the lessons that I wanna leave you with. So one is to invest in the right processes and systems to be secure, compliant, and nimble with your data.

Like maintaining trust really is everything when, especially in a regulated space, um, seek to reach a proactive wellness state, uh, with your operational data through deep business process alignment and practical data contracts, and consider getting on board to any relevant data standards bus, uh, in your, in your industry to drive interoperability and, um, really not only within your organization, but also between organizations at scale.

So that's, that's all I've got. Um, thanks again for having me. Um, you can find me on LinkedIn or, um, find out more about helix@helix.com. Awesome.

Thank, thank you so much Anna. Um, really enjoyed your conversation, especially, um, with my background before I got super into data contracts, I was really into, to healthcare and whatnot. Um, and maybe add some color to the audience, um, how, why healthcare data's so hard. Um, a lot of the data's collected is through those EHR records such as epic, and as you know, all the doctors hate them.

So the very people who have the domain knowledge, who are collecting the data, uh, do not like using the tool for collection. And I was really impressed with you talking about that con framework. I haven't heard of that before, is how do you change the behaviors of these upstream people who are collecting data who historically have had a bad relationship with these tools?

Yeah, often, I mean, often the folks that we're, um, that we're working with are the IT departments of the, of the health systems. And they're, they're really like the data champions for their, um, for their organizations. They're, they're incentivized to make the data useful for, um, for leadership at their, at their locations too.

So that's, that's part of what we draw in. And then really making it easy for them is the other, the other, um, carrot. So how do we, how do we identify exactly what the issues are that we f find, and then, and then show them where, where those are and, um, in some cases even how to fix them. Awesome. Awesome.

I'm gonna go check the q and a real quick, see if there's any other, uh, questions here. Wait, wait, wait. I got you. I got you with this. I Rios. I see the mark. You can't see your screen right now. Um, so I don't, I don't see any specific questions coming through on, uh, on this talk. I'm just looking right now through the chat and Ooh.

Hmm. It's, there's general, not necessarily data questions. Uh, there's general kind of health tech questions we could say, and so I'm not,

not seeing it. I can an eye, I can jump up. We gotta come up with questions. So a big, a big part of shift left is the change management component. How do you balance, you know, moving quickly and doing these kind of new innovative approaches to data while being in a highly regulated space such as healthcare?

Yeah, that's a good question. I think it, you, you wanna be clear about the goals of what you're trying to achieve with the, with the data products that you're building. Um, so staying focused on how you bring value to the users who are, um, who are trying to use the data, um, is really. I would say really the key.

Um, and then you can prioritize different aspects accordingly. It helps to have a great, you know, um, great partnerships with, with the, you know, compliance, security, and privacy teams, um, to kind of get any questions answered quickly. Hmm. Awesome. And then we have another question in the, in the chat is, um, how to find the bad data.

Maybe to add more, more context for, for that. Um, healthcare data is vast. I imagine you're working with massive data sets and you talk about the heterogeneity. Um, it's, it's almost like being overwhelmed with data and data quality issues. How do you prioritize that and know which, you know, bad data to focus on first?

Um, so, so first you want to kind of, so again, like first, if you start with the, the structural kind of things and make sure that the, um, the structure of your data, the relationships between things are as expected, the frequencies are as expected. Um, if you've got that all down, then, um, then really you wanna focus on the content again, kind of like the content that's relevant to, to the use cases that you know are going to drive value or that you, you have stakeholders that are interested in.

So, um, you know, we might not be able to fix every, every issue in this vast EHR record collection that we have. But, um, by looking at, you know, what are the, what are the common things that we know should be there? What are the things that, um, our research areas of interest, um, for, for the groups that we work with?

Then we can, we can do more, um, more granular, kind of, um, more granular looks into that. Nice. There is another one coming through, which, uh, is fascinating to me. 'cause it's talking about how a lot of shifts left, seems to depend on the alignment between data producers and consumers. Is that harder to implement when you depend on source systems?

You don't own like EHRs?

You know, it, in some ways it's harder because you may not have, um, you may not have like a direct line of communication to somebody who could, who could solve the problem. But even, even these, you know, these operational kind of workflows or cases where you have, you have engineering teams that are producing data, um, and you're consuming that data, the amount of context that, um, that has to be like captured and, and tracked between a large set of people is usually also non, non-trivial.

Which is why, which is why it's hard. It's why we, you know, why we're having this, this conference. Um. So in some ways when we have at least, you know, we have the data structure standard and, and certain things that we a a lot, quite a lot of expectations defined about what we expect to be true in the data.

Um, that, that goes a long way in, um, being able to know, you know, which, which data's good, how to handle the things that don't meet the expectations. Hmm. Excellent. All right, well guess what time it is. I gotta wrap this up because we have more speakers on the way."