Data is at the core of AI systems, but here's the paradox: even as generative models grow to billions of parameters and train on massive datasets, your organization's specific data remains more crucial than ever. Successfully deploying real-world AI applications goes beyond leveraging powerful models and requires flexible pipelines that can seamlessly connect those models to your data where it lives, whether in databases, document stores, or cloud storage. While large models provide an impressive foundation of general knowledge, it's the ability to enhance them with your own data holdings that transforms prototypes into production systems. Modern approaches like retrieval augmented generation (RAG) complement model training and fine-tuning, and a thoughtful approach to integrating unique and proprietary data is critical to maximizing the potential of generative AI.
ML History: Data and Algorithms Collide
Not long ago, algorithms and the data could be treated as two related but separate components of software. An algorithm could be designed by human minds to follow rules set with an understanding of the domain, context, and shape of the data it was intended to process or analyze. It did not shift, evolve, and adjust away from those rules that defined it as data flowed through it. The resurgence of deep learning in the 2010s brought a fundamental shift to this. Old school algorithms certainly haven’t gone away (nor are they likely to any time soon), but advances in hardware put computer science and statistics on a collision course that brought aspirational ideas from the 20th century into the spotlight: feed-forward neural nets leaping beyond the limits of the perceptron and convolutional neural nets (CNNs) of previously-intractable scale had a renaissance.
The era that followed these dormant machine learning (ML) techniques awakening from one of their AI Winters fundamentally changed the relationship between data and algorithms in production: the systems still operated on data in and produced data out, but they had data flow through them an entirely different way during their training. As these ML systems processed training data to better predict and produce outputs, they also internalized the information held within; trends and generalizations about the data they encountered became encoded in their weights. In doing so, they happened upon algorithms that wouldn’t or perhaps couldn’t (reasonably) be defined by humans directly. Andrej Karpathy’s Software 2.0 (2017) chronicled the implications of this shift with great foresight, and laid out ambitious predictions for the future.
Your Data Remains Indispensable
The following road to modern generative AI models has been far from linear, and it will likely be some time before a coherent historical summary emerges. However, one trend has been abundantly clear along the way: the models just keep getting bigger and bigger. Current state-of-the-art architectures, like attention-based transformers behind most Large Language Models (LLMs) today, as well as diffusion-based image generators, have shown an uncanny ability to scale their performance based on amount of training data and number of model parameters. This scaling isn’t infinite nor necessarily the efficient way forward, but the gains from training larger models have brought about the massive models we see today such as Meta’s Llama 3 models with billions of parameters.
The very large parameter space of modern AI models gives tremendous capacity for internalization of training data. After all, the ability to draw from that information and provide relevant generations to the input is the very thing that makes them useful. Through this impressive capability, LLMs have inspired applications and use-cases that rely greatly on the internalized understanding and pulling from fragments of training data. This approach of relying on internalized data is tempting as a first step for LLM experimenters because it treats the system as a self-contained computation machine. Does this mean the end of custom datasets? Absolutely not, and individuals and teams who have begun bringing generative AI models to production know this well.
This is one element of the phenomenon our previous blog post (The 80/20 Rule of Generative AI: From Prototype to Production) covers: the massive amount of information internalized by large models can provide tremendous inspiration and glimpses of possibility. However, industry and business-specific use cases remain unique. Just as there is 80% of the implementation remaining after the inspiring 20% up front, the vast amount of data modern generative AI models have been trained on are, somewhat paradoxically, only the starting point for the data a generative AI pipeline needs to be successful. Even if the scale is small in comparison, the last mile of use-case specific data and information can be critical.
Making Your Data Available to AI Models
Two popular ways of introducing new data to an LLM are fine-tuning and retrieval augmented generation (RAG). Fine-tuning, the more computationally-expensive and laborious of the two, involves further training of a model with one’s own data to improve performance in relevant domains and tasks. As new data is run through the training process, weights are adjusted and optimized for this new data. For organizations that have vast, well-curated data, compute, and engineering resources, this can offer a way to customize a model with their specific data holdings to improve performance on related tasks. However, these very requirements make it difficult for smaller teams to build with, and even larger organizations with more resources may find the cost-benefit analysis of training massive models unfavorable to the alternatives. The evolution of model architectures and more efficient ways to go about fine-tuning are areas we’re tracking at Meibel that could shift the difficulty in the future, but for the present these barriers to fine-tuning remain for many groups.
RAG, on the other hand, can be useful at nearly any scale. RAG systems index and make available contextually-appropriate data elements never before seen to the LLMs at generation-time. The advantage of this method is that it involves no expensive retraining of model weights and, depending on implementation, can be highly efficient in selecting which new data elements are provided for which generations. These efficient selections of data boost performance right at generation time with much lighter-weight preprocessing than fine-tuning.
One additional benefit of RAG is in the degree of specific data recall it allows. Documents made available to a model through RAG don’t go through the information compression process that occurs when being trained into the weights. As a result, relevant information can be made available intact to models in a way that supplements even fine-tuned model workflows. In this sense, we find it extremely useful to give AI models an ingest pipeline to our users’ most relevant data exactly as it is, where it is.
Meeting Your Data Where It Lives
When a team first implements a basic RAG workflow with their own data, the boost in performance and ability to work with hyper-specific information can be inspiring and encouraging. However, two major questions soon naturally follow for those serious about scaling up their workflows:
- How do I bring more types of data into my AI pipeline?
- How do I make sure that my RAG system stays up to date with new, changing, and removed data elements?
The answers to these questions are crucial to taking a generative AI pipeline from successful prototype to full-fledged implementation. Although the relationship between data and the algorithm to process it looks very different compared to past decades, the importance of building strong connectors and pipelines to get data from where it’s stored to where it’s processed remains just as important. Data connectors that quickly and efficiently reach out to disparate data sources may not be where one’s imagination first heads when thinking of AI workflows, but the importance of building them and building them right is hard to overstate. Your data needs to be secure, accessible, up to date, and minimally duplicated. These are all specifically areas of focus that have driven the design and implementation of Meibel’s data workflows. The key concept of meeting your data where it lives has helped us design and build a platform that maximizes compatibility with different data sources and minimizes configuration when adding and refreshing datasets.
Meibel’s Approach
When building Experiences in the Meibel platform, users can configure ingest connectors to point to the data sources, whether they be in structured databases like PostgreSQL tables or in bulk object storage like Amazon S3 buckets. Customizable discovery and ingest processes then periodically check, refresh, and load relevant data elements into indexed representations available for model use. This makes it as easy as possible for our users to configure new data sources that feed into the same workflows, and not have to worry about the implementation of data refreshes. We put a lot of care into the data engineering behind the ingest flows so that builders of new AI workflows and Experiences using Meibel can spend their time connecting to where their data already lives and is maintained, rather than replicating, re-modeling, and re-engineering their data pipelines.
A very important result of this approach is also the way it frees workflow builders to experiment and rapidly iterate with different combinations of data. Not all data context is equally valuable for the performance of AI workflows, and experimenting with the right combination of data sources and representations to get optimal outputs is a balance of performance, cost, and fitness for purpose. We want to enable our users to take their prototypes to production success and beyond. Meeting their data where it is and building powerful pipelines to make the most of it is a key part of our strategy to do so.