Vector Search to Success

Exploring the role of RAG in developing sophisticated LLMs.

Scott Dykstra

Co-Founder & CTO

Before 2022, if you wanted to quickly recall a specific passage from your favorite book or a quote from a movie you just watched without the work itself in front of you, you’d probably turn to a search engine. You’d prompt it with a well formulated search input, parse through the returned results, visit the SparkNotes or IMDB link that appears to contain your answer, and find the text you’re looking for on the page within a few minutes. Now, you simply open ChatGPT, type “what’s the most famous Terminator quote?” or “write out the opening passage of A Tale of Two Cities” and have your verbatim answer back in seconds.

One of the simplest uses for a large language model (LLM) is as a database of knowledge. LLMs have been trained on vast datasets of rich information, which interfaces like ChatGPT have made it easy to retrieve. When you prompt ChatGPT to return content from a movie or book, for example, you’re simply leveraging the model’s ability to recall information that it’s been exposed to during its training. But what if it wasn’t trained on the Terminator script, or if its weights don’t give importance to Dickens’ works? In order to deliver the most accurate and relevant results for even the simplest of use cases, such as basic information retrieval, LLMs need sophisticated indexing and retrieval mechanisms that can access a broad spectrum of information with precision.

Understanding LLM content generation and training

LLM content is generated through a process known as next token prediction, which ensures that responses are contextually appropriate, varied, and somewhat reflective of human-like understanding. Here’s how next token prediction works, step by step:

  1. Input Processing: When you type a prompt or a question, that input is converted into tokens: words or pieces of words. 
  2. Context Understanding: The model looks at the tokens you’ve given it and, based on its training, tries to understand the context, which includes everything from the topic at hand to the tone you might be using.
  3. Next Token Prediction: Using the context it’s understood, the model then predicts what the most likely next token is. It’s not just guessing based on the immediate previous word; it’s considering the entire context of the conversation up to that point. 
  4. Token Selection: Once it has predicted a range of possible next tokens, it selects one. This selection is based on probability—the token that’s most likely to come next based on the data the model was trained on. It's worth noting, however, that there's some randomness here too, which helps generate more varied and natural-sounding responses.
  5. Output Generation: The selected token is then converted back into human-readable text. If the response isn’t complete (which it often isn’t after just one token), the process repeats. The new token is added to the sequence, and the model predicts the next token based on this updated context.
  6. Iterative Refinement: This process of predicting the next token and adding it to the sequence repeats until the model reaches a stopping point. This could be when the response reaches a certain length, the model predicts a token that signifies the end of a sentence or passage, or when it fulfills the instructions embedded in the prompt.

Limitations of compression in LLM training

When an LLM predicts a token, it's effectively retrieving and utilizing the compressed knowledge embedded within its weights to produce contextually appropriate outputs. In this way, LLM training mirrors database compression. Just as a database is optimized to recall frequently accessed data quickly, an LLM is designed to retrieve information—specific interpolated memories—from its weights. This capability allows it to produce precise responses to queries about familiar material it has encountered during its training, much like querying a database for well-indexed information. However, limitations arise when the model encounters less familiar or obscure content. For example, when you ask the LLM for specific passages in the Bible, it quotes them word for word, but it cannot quote word for word any concept it has not redundantly “witnessed” during training, as the weights associated with that concept are too insignificant. In that sense as well, the LLM is analogous to a database. Just as a database might only return data that has been explicitly stored within it, an LLM can struggle with generating content on topics it has not extensively seen during training. 

Of course, LLMs are beyond the scope of this analogy, as they have a world model internally that allows them to "understand" things purely beyond just lookups. However, this oversimplification helps us understand some of the key limitations in the way that LLMs are trained to generate content. 

Further limitations of LLM training

Furthermore, the next token prediction system has other inherent limitations that stem from its fundamental approach to generating text:

  • Context Window Size: One of the primary constraints is the model's context window size—the maximum amount of text (in tokens) the model can consider when making a prediction. For many models, including earlier versions of GPT, this window is not large enough to maintain context over long conversations or documents, which can lead to loss of coherence in longer texts or complex discussions that require maintaining context beyond the immediate preceding tokens.
  • Generalization vs. Specificity: While these models are trained on vast datasets, their ability to generalize from this training can sometimes lead them to produce generic or vaguely relevant content. They might miss the mark in generating highly specific or nuanced responses that require detailed understanding or up-to-date knowledge outside their training data.
  • Lack of External Knowledge Access: Next token prediction models are limited to the information contained within their training datasets. They cannot access or incorporate new information post-training, which means they can quickly become outdated or lack current context, such as recent events, discoveries, or trending topics.
  • Repetitiveness and Predictability: The algorithmic nature of next token prediction can sometimes result in repetitive or predictable text generation. Since the model often favors tokens that are statistically more likely to follow given the context, it can fall into loops or prefer common phrases, reducing the variability of the output.

Retrieval augmented generation (RAG) explained

As aforementioned, LLMs generate responses based on the weights they've assigned to different aspects of the data during training. These weights reflect how important or significant various elements of the input data are perceived by the model. If a user's prompt includes elements that were not significantly represented in the training data, the model may fail to generate an accurate or relevant response.

When a conversation exceeds an LLM’s context window, or when a prompt exceeds the limit of significant weights in an LLM’s own training dataset (meaning it cannot recall exactly the answer the user is looking for), the model typically relies on an external vector search database, which allows it to search for relevant context or fresh data that can be appended to a prompt from the user. This process is known as retrieval augmented generation (RAG).

“Vector search to success”

The RAG process is made possible through a vector search database: an advanced type of database that stores and manages data as vectors. These vectors represent the data in a high-dimensional space, where each dimension captures some aspect of the data's meaning, allowing for the representation of complex relationships and attributes. In the context of text and language, vector search databases use techniques such as embeddings to convert text into numerical vectors. This conversion enables the system to measure semantic similarities between different pieces of text by calculating the distances between their corresponding vectors in this multidimensional space. 

During RAG, both the query (i.e., a user's input to the LLM) and the stored data (such as articles, documents, or sentences) are converted into vectors using text embeddings. These embeddings transform the textual data into numerical vectors where similar meanings are mapped to proximate points in the vector space. The database then computes the distances between the query vector and the vectors of the stored data to determine how closely the meanings of the texts are related. The database retrieves the data points (textual content) whose vectors are closest to the query vector, i.e., those that are semantically most similar to the input. These data points are considered the "nearest neighbors" in terms of context and meaning.

These nearest neighbors provide contextually relevant, additional information that the base LLM might not have had access to within its own training data, which can significantly improve the accuracy, relevance, richness, and variety of the LLM’s outputs. Sam Altman, among others, has advocated for the “vector search to success” approach—relying on RAG for developing agents, rather than model fine-tuning alone.

RAG as an alternative to fine-tuning

Fine-tuning an LLM involves adjusting a model's weights based on additional training on a specific dataset to enhance performance for particular tasks or improve understanding in certain domains. Not only is this process slower than the pace of innovation, meaning that fine-tuned models become obsolete almost as quickly as they’re updated, it also doesn’t address the issue of fresh data.

In contrast, RAG enables the model to access external databases in real time to retrieve the most current information relevant to the query at hand. Even if the underlying model hasn’t been updated or fine-tuned recently, it can still generate responses that include the latest data. Models remain relevant longer because they can adapt to new data and changing contexts through the retrieval of external information sources.

RAG effectively bridges the gap between deep learning and traditional information retrieval techniques. By doing so, it leverages the strengths of both—deep learning's powerful contextual understanding and the precision of information retrieval. This hybrid approach allows LLMs to produce more accurate, detailed, and contextually rich responses.

Addressing the further limitations of LLMs

Beyond fine tuning, RAG also addresses the previously noted challenges associated with standard LLMs:

  • Expanding Contextual Understanding: RAG extends the context window of traditional LLMs by fetching up-to-date or detailed information that enhances the model's responses.
  • Enhancing Specificity and Accuracy: Instead of relying solely on patterns learned during training, RAG allows the model to inject specific details from retrieved documents into its responses, making them not only more accurate but also tailored to the specific query at hand.
  • Mitigating Repetitiveness and Predictability: By dynamically pulling different sets of information for each query, RAG can vary a model’s responses significantly. This variability helps in reducing the repetitiveness and predictability often seen in pure generative models, as the external data introduces new phrasing and details into the conversation.

Challenges and necessary evolution of RAG

RAG comes with its own challenges, however—namely latency and lack of intelligence. Think about a turn-based agent chatbot conversation where the user submits a prompt, the LLM spits out a few tokens indicating it needs more context, a vector search database retrieves nearest-neighbor context via the user’s input prompt, and then both are finally sent to the LLM again for inference. Then, it’s the user’s turn to reply, and so on.

In this system, each user prompt initiates a multi-step operation where each step adds to the total processing time. The speed of the entire process is also contingent upon how quickly the vector search database can retrieve the necessary context. If the database query is complex or the database itself is large and not optimally indexed, this retrieval can introduce significant delays. Additionally, especially in more complex dialogues, this sequence of generation and retrieval may need to be repeated multiple times to refine the response adequately. This iterative cycle can compound the latency, leading to slower interactions than might be feasible with a purely generative model that relies solely on internal data.

Furthermore, the intelligence of a RAG-enriched LLM is significantly dependent on the quality and relevance of the information retrieved from the vector search database. If the database content is not comprehensive, up-to-date, or well-maintained, the utility of the retrieved information might be limited, impacting the overall intelligence of the responses.

Even when high-quality external data is retrieved, the challenge remains in how effectively this information can be integrated into the existing response framework of the LLM. The model must not only incorporate this external data but do so in a manner that is contextually appropriate and coherent. Misalignment between the model’s training and the nature of the external data can lead to responses that are technically accurate but contextually disjointed.

The next generation of LLMs

The next generation of LLMs will likely blend vector search-based RAG and traditional training/fine-tuning methods together, along with structured data processing (e.g. SQL databases of TradFi market data and associated financial news). The concept of having an LLM provider ‘over here’ and a separate vector search database ‘over there’ will collate via new models which intuitively extend their indexed working memory to local SSDs with terabytes of vectorized context. 

Space and Time has already delivered Proof of SQL—a ZK proof that verifies the accuracy and tamperproofing of SQL database processing—to clients and more recently shipped Proof of Vector Search, which does the same for vector search retrieval. These novel proofs open the way for a future where LLMs can integrate fresh context, access a broader and more nuanced spectrum of data in real time, and integrate structured data processing to yield more insightful analytics, all in a traceable, verifiable way. These advancements will ultimately broaden the scope of applications for LLMs, extending their utility in sectors that rely heavily on up-to-the-minute data, such as financial services, news aggregation, and risk assessment, thus driving forward the next wave of AI-driven innovation.

Scott Dykstra

Co-Founder & CTO

Scott Dykstra is Co-Founder and Chief Technology Officer for Space and Time, as well as Strategic Advisor to a number of database and Web3 technology startups including Sotero. Scott has a tenured history of building and scaling large engineering teams around complex greenfield challenges and research-driven development. With a specialization in enterprise-scale analytics, Scott previously served as a VP of Cloud Solutions at Teradata, where he spent nearly eight years bringing Teradata from on-premise deployments to next-gen, cloud-based SaaS. Scott is a visionary product leader, with product development expertise across Web3, data warehousing, cloud, and derivatives trading. Scott obsesses over beautiful UX/UI. An entrepreneur to the core, Space and Time is Scott's second successful endeavor as an executive for a research-driven startup.