Gable.ai Editorial: “ChatGPT: Regarding LLMs and their ability to hallucinate, what famous and compelling quote would you recommend that highlights the importance of LLM data quality?”
ChatGPT: “Garbage in, garbage out.”
Well, bravo to that.
As tempting as it was to leave it at this—that answer will be a tough act to follow—the topic of large language model (LLM) data quality deserves its share of unpacking.
Because, to business leaders, large language models and related technologies are currently a high-visibility topic in the data-driven world. But at the same time, the tech industry is incorporating piles of LLM-related… stuff into our personal and working lives—often without our even knowing it.
(Photo illustration by Gable editorial / Midjourney)
#(Alt image text for Gable CMS: A conceptual image of LLM data quality in the shape of a brain-like talk bubble.)#
What puts the “LL” in LLMs?
We’re going to keep things as simple as possible here, as our overall focus is on LLM data quality—not tumbling down the LLM rabbit hole.
To appreciate the vital role data quality plays in the LLM world, let’s explore and share a basic understanding of how they are designed, built, and trained.
LLMs, like OpenAI’s GPT models, are advanced artificial intelligence (AI) systems—a subset of generative AI (GenAI)—that engineers design to generate new, human-like text based on written or spoken prompts. Their “largeness” comes from both their architecture and the immense data and parameters they use.
Transformer architectures, particularly the now-popular decoder-only models, enable efficient parallel processing of text sequences. This architecture, combined with billions (or trillions) of parameters, gives LLMs the computational gunpowder they need to handle vast and complex inputs and outputs.
The language aspect of LLMs stems from how these models interpret and generate text. Instead of seeing them as we do, these machines represent them as multi-dimensional vectors called word embeddings, which capture relationships between words based on patterns in the data. Building an LLM involves feeding it enormous, diverse text datasets (e.g., books, articles, forum chats) that professionals clean, format, and tokenize to optimize learning.
Training starts with unsupervised pre-training, where models predict missing or next words to build foundational language skills, followed by fine-tuning for specific tasks. Iteration, evaluation, and reinforcement learning further refine these skills.
In short, LLMs are “large language” models because of their vast scale, both in data and computational complexity, and their ability to generate human-like text through nuanced understanding of language patterns.
The importance of data quality in LLMs
It’s clear now why data is oxygen to the LLM. Just as with humans, the quality of an LLM’s oxygen plays a vital role in both their development and performance.
Data quality issues, when and if they occur, can leave an LLM speechless. Some of the reasons why are obvious—but others, not so much.
Accuracy and reliability
At the most fundamental level, LLMs need high-quality data to learn accurately from their training datasets. In this way, optimal data quality directly correlates to LLMs that produce more reliable and precise outputs.
Increasingly, this accuracy is what LLM developers seek to showcase, especially when pitting one model against another. However, this accuracy and reliability are paramount in industries like healthcare and finance, where poor data can prove costly.
Bias mitigation
Poor-quality data is often inherently biased. Unfortunately, engineers and developers who choose to use poor-quality data may risk infecting the subsequent performance of their LLM model despite how much care and attention they pay to it.
Alternately, engineers can actively combat this outcome by thoroughly vetting all potential sources of LLM data—working to ensure that they are balanced, complete, diversely sourced, and equally representative.
Generalization capabilities
When engineered using high-quality data, LLMs are better at generalizing across various use cases and domains. Today, this performative agility is essential in demanding scenarios like sentiment analysis, language translations, and content generation.
In the near future, generalization capabilities like these will be essential as AI engineers chase promising initiatives like artificial general intelligence (AGI). These machines, unbounded by how they are designed and created, are able to learn and grow without human intervention or guidance.
Efficiency and cost-effectiveness
Pedestrian compared to the prospects of AGI, the exciting future many in both sci-fi and science envision isn’t achievable without the meat and potato factors that are efficiency and low costs. This directly applies to LLMs, as engineers training models on high-quality data often require less from the training process overall, which saves time, resources, and valuable computing.
7 Key LLM data quality challenges
Whether you're building a modest LLM to help hard-working parents perfect their favorite salsa recipe or creating a groundbreaking machine learning model aimed at surpassing GPT-4, access to high-quality data is essential for success.
Unfortunately, when the need for optimal data quality is high, challenges tend to arise. Key challenges range from the sheer scale of the data involved to the need to balance access to information with consumer expectations of privacy.
1. Unfathomable datasets
The sheer size of modern pre-training datasets makes it nearly impossible to manually assess and ensure data quality. For example, according to one source, Meta used 1.8 trillion tokens to train its Llama 2 model, released in July 2023.
Over the following year, Meta scaled up, training its Llama 3.1 models with 15.6 trillion tokens. Similarly, Alibaba Cloud used an estimated 7 to 12 trillion tokens to train its Qwen 2 LLM model, released in June of this year.
Due to these issues of data management at scale, AI engineering teams often rely on heuristics and filtering techniques to manage these datasets, which can lead to issues such as near-duplicates and benchmark data contamination.
2. Bias and ethical concerns
Based on the trajectory of LLM advancement, the need for good information to train and refine ever more advanced learning models is theoretically infinite. In the real world, however, data availability is most undoubtedly not.
Factors like budgets, privacy laws, and the theoretical use cases of a given LLM all affect which sets of information are available, let alone viable. For these reasons, companies that develop AI tools face difficult decisions when determining what information they are willing or even able to use for training their models.
It can be tempting to cut corners. But stakeholders or teams who make bad compromises with their data quality during the initial phases of a project can unintentionally introduce biases or irrelevant content into the deep tissue of their LLM.
In turn, this can spur miscalibrated responses or hallucinatory outputs. Any reliance on publicly available data sources can exacerbate these issues, as the information they contain may not be representative or ethically sound.
3. Data scarcity and accessibility
The availability of high-quality data varies greatly around the world. Teams operating in regions with strict data privacy laws or where access to open-source information is limited face additional challenges.
Even for datasets that are accurate, reliable, and complete, a lack of diverse information to draw from can cripple LLM utility.
4. Optimal data quality maintenance
Acquiring the data quality needed to create a GenAI model is one thing; maintaining that quality is an entirely different challenge. Engineers and developers rely heavily on unstructured data—such as video transcripts, books, and blog articles—to train AI models like LLMs. These resources are ideal for teaching models to converse like humans because they reflect natural communication.
However, preparing unstructured data requires significant effort. Teams must ensure datasets are relevant, accurate, and complete before use, a process that demands both time and energy. To make the data usable across systems, they also rely on consistent and precise labeling, which is labor-intensive and time-consuming given the vast amounts of data involved.
As teams manually prepare and label data, they may inadvertently introduce errors and biases that weren’t present initially. This underscores the complexity of balancing data preparation with the need for high-quality, unbiased training materials.
5. Data silos
Fragmented data across different systems makes it difficult to achieve the unified view that’s necessary for dialing in effective LLM model performance. Overcoming this requires robust data quality management (DQM) and integration solutions. In the absence of DQM and related solutions, data siloing can occur, increasing training costs, hindering team collaboration and, ultimately, degrading the potential of the LLMs produced.
The mere existence of separate data silos within an organization increases IT costs due to duplicated infrastructure and storage systems. It also compounds inefficiencies as LLM teams may need to duplicate efforts when accessing or processing data for their model training.
Silos reduce opportunities for collaboration between teams, as the very nature of silos hinders the ease with which key information can be shared across departments. Poor communication in LLM companies can easily eat its way into the innovation and creativity needed for AI model development.
Ultimately, data silos contribute to fragmented and isolated data sources, which can lead to incomplete datasets. This fragmentation makes it difficult, if not impossible, for AI models to access the comprehensive data needed for accurate training and predictions.
6. Metadata issues
Poor metadata can lead to data governance and usage problems that affect the downstream quality of AI outputs.
Moreover, metadata plays a pivotal role in contextualizing and organizing training data, but incomplete or inconsistent metadata can hinder data discoverability and processing. Comprehensive data governance policies and continuous monitoring are essential to address these issues.
7. Data privacy regulations
Compliance with data privacy laws can further limit the availability of certain datasets, making it challenging to acquire diverse and representative training data.
Stringent regulations, such as GDPR in Europe, can prevent organizations from accessing or sharing key datasets. While this isn’t a brick wall, per se, maintaining compliance while enriching LLM model training data may then require approaches to data anonymization and synthetic data generation that are beyond the means of most organizations.
How data contracts can ensure top-tier data quality for LLMs
Embracing the fact that high-quality data is the foundation of effective LLM training does mean accepting related challenges, such as the scale, bias, and privacy concerns that complicate data management.
For this reason, data leaders in the LLM world are yet another group that must embrace data contracts—proactive solutions that actively ensure data quality standards are upheld throughout any data lifecycle, including those of LLMs.
With proper drafting and implementation, data contracts can:
- Define data quality standards at the source
- Align stakeholders on data usage and purpose
- Streamline data governance and data validation efforts
- Mitigate bias through transparent agreement
- Promote accountability and transparency
- Facilitate compliance with privacy regulations
Overall, data contracts act as GI/GO bulwarks, protecting organizations from any compromise or degradation of the data stream. Whether they’re implemented on behalf of GenAI teams or not, these advantages benefit all data leaders. After all, ensuring that data strategies deliver measurable value is mission-critical in any data-driven business.
This is why it’s worth the time to sign up for our product waitlist at Gable.ai. In the world of big data, exceptional outcomes are in everyone’s best interests.