SOTA-RAG Series

State Of The Art Retrieval Augmented Generation - Retriever

Fokke Dekker
#RAG#LLM#SOTA

SOTA RAG Series
This is the fourth blog in our series building a state-of-the-art retrieval augmented generation SOTA RAG pipeline. You can read up on the full series here:

In this fourth edition of our state-of-the-art retrieval augmented generation (SOTA RAG) blog series, we explore the concept of retrieving information for RAG.

The Demo Application
LostMinute Travel is our demo application. It offers users a convenient chat interface to plan your ideal vacation. LostMinute Travel uses a combination of input data to provide the best possible travel advice. This includes data mined from travel brochures, Wikipedia pages, travel blogs, and more.

Architecture Overview

The goal of the Retriever is to find the most relevant documents for a user’s input query. If you’ve been following along with our series, you’ll know that in a state-of-the-art retrieval-augmented generation (SOTA RAG) application, data is stored across various data stores to capture its semantic value and improve retrieval performance. Unlike a “simple demo on your laptop,” we’re dealing with large datasets, requiring a more sophisticated approach to retrieve the right information.

At a high level, the retriever infrastructure runs within a single Cloudflare worker and consists of the following key elements:

Query Input

The query input component serves as the entry point for the cognitive architecture to query the retriever. It provides a simple-to-use HTTP POST endpoint that accepts a JSON payload with the user query. This component sequentially calls all the necessary functions and returns a JSON object with ranked results. The query input runs on a Cloudflare worker and ultimately returns a structured JSON object containing the semantic chunk (text), source (URL), tables (tabular data), and keywords.

User Query Function

Think about how you typically interact with chatbots. User input is often not the most clear and concise, to put it mildly. However, user expectations are high. To address this, we assume by default that the input might be less than perfect, and we use the user query pipeline to process it into better-formulated queries for search.

The provided input query is run through an LLM using llama-2-13b-chat-awq, hosted on Cloudflare. We use it to extract the following information from the user query:

The output of the user query function is then sent to the semantic chunk function, where it’s broken into chunks suitable for running search queries on the data stores.

Hallucination Function

The hallucination function serves a similar purpose as the user query function, but instead of refining the user query, it enhances the input with additional content to improve search results. We accomplish this by using an LLM (@hf/thebloke/llama-2-13b-chat-awq run through the Cloudflare AI gateway) to “hallucinate” an answer to the user query. We use the term hallucinate because the answer generated isn’t guaranteed to be correct but provides useful additional context for the query.

For example, if the user query is “What is the capital of France?”, the hallucination pipeline might return, “The capital of France is Paris. Paris is a beautiful city with many attractions such as the Eiffel Tower and the Louvre Museum.” This extra context can improve the relevance of search results.

The output of the hallucination function is sent to the semantic chunk function, where it is broken into chunks suitable for running search queries on the data stores, just like in the user query pipeline.

Semantic Chunk Function

The semantic chunk pipeline is responsible for breaking search queries into semantic chunks that can be used to run searches on the data stores. The input to the semantic chunk pipeline is a list of questions and keywords extracted by both the user query pipeline and the hallucination pipeline.

Unlike regular chunks, semantic chunks break down the user query based on meaning rather than token count, sentence boundaries, or arbitrary rules. This allows the search engine to better interpret the user’s intent and retrieve more relevant results.

To create these semantic chunks, we first split the text into sentences and create chunks of three sentences with a two-sentence overlap (i.e., 1,2,3, 2,3,4, 3,4,5). Each chunk is embedded using @cf/baai/bge-large-en-v1.5 on Cloudflare. We then loop through the chunks recursively, calculating their cosine similarity based on the vector embedding.

Cosine similarity is the cosine of the angle between the vectors; that is, it is the dot product of the vectors divided by the product of their lengths. In simpler terms, cosine similarity helps us understand how similar two vectors are. Chunks with a cosine similarity of .9 or higher are merged, unless their total length exceeds 500 tokens.

The result is a set of semantic chunks where similar concepts are merged together in chunks of no more than 500 tokens. This process is applied to all inputs, including the output from both the user query function and the hallucination function. The resulting semantic chunks are sent to the pull function to retrieve relevant data from the data stores.

Pull Function

The purpose of the pull function, as the name suggests, is to pull all relevant information from various data stores. If you’ve been following our series, you know we use the following data stores:

The pull pipeline takes the extracted keywords and semantic chunks to retrieve all relevant information. Specifically:

The result is a large amount of relevant data, which could overwhelm the context window of even the largest LLM. That’s where the rank function comes in to prioritize the most relevant data returned by the pull pipeline.

Rank function

Now, you might ask: “Why don’t you just return less data?” Excellent question. While all of these techniques are designed to return relevant data, they each do so in their own way. The vector store returns the most closely related vectors (cosine similarity), the graph DB returns data about related entities, and so on. While all of this is relevant, we still need to decide which data is the most relevant. For example, if we only returned 10 items, we might end up with an order like this:

In other words, we want to return more data so the ranking algorithms can establish what’s truly relevant. Ranking algorithms can take into account context that individual data stores can’t. For instance, while the vector search works on a single chunk, the full query might contain multiple chunks. The ranking algorithm considers the entire search input, filtering through the returned results from all the stores to return the most relevant pieces.

There are several ways to rank the results from the retrieval pipeline:

Depending on your use case and available data, you might choose one or more of these ranking methods. Since Lost Minute Travel is a demo application, we don’t have engagement data yet, nor do we have multiple dataset versions, making EngagementBoost and PreferOld/PreferNew unavailable for now. Because we indexed Wikipedia, the likelihood of spammy results is low, so we can rule out AvoidSpam. As LuckyRank is primarily for introducing diversity rather than improving quality, we’ve decided not to use it either. That leaves us with LLMRank and TopicRank.

Resulting in a total of 10 highly relevant documents per input query returned from the retriever pipeline.

Score Function

Having a human in the loop is highly beneficial when working with large language models. Even the best guardrails and ranking algorithms can’t prevent things from going off-track occasionally. This is exactly what the scoring function is designed to address. Every output from the model is tagged with a unique ID (more on that in the upcoming cognitive architecture blog). Users can provide feedback on each generated output by giving it a thumbs-up or thumbs-down.

Each thumbs up or down stores the output ID, the generated content, the input query, and the related data points (i.e., the output from the retrieval pipeline) in a D1 database. Over time, this builds a large dataset that can be used for the EngagementBoost ranking algorithm, improving future outputs by learning from previous user interactions. Additionally it provides insights into potential blindspots in the data.

Stay tuned for the next edition of this blog series were we will go into great detail about how to design a cognitive architecture and be sure to follow us on LinkedIn to be notified when it’s released. Schedule a free consultation with our team if you want more in-depth information on building a SOTA RAG application yourself or with our help.

Subscribe to our newsletter

← Back to Blog