SOTA RAG Series
This is the fourth blog in our series building a state-of-the-art retrieval
augmented generation SOTA RAG pipeline. You can read up on the full series here:
In this fourth edition of our state-of-the-art retrieval augmented generation (SOTA RAG) blog series, we explore the concept of retrieving information for RAG.
The Demo Application
LostMinute Travel is our demo application. It offers users a convenient chat
interface to plan your ideal vacation. LostMinute Travel uses a combination of
input data to provide the best possible travel advice. This includes data mined
from travel brochures, Wikipedia pages, travel blogs, and more.
The goal of the Retriever is to find the most relevant documents for a user’s input query. If you’ve been following along with our series, you’ll know that in a state-of-the-art retrieval-augmented generation (SOTA RAG) application, data is stored across various data stores to capture its semantic value and improve retrieval performance. Unlike a “simple demo on your laptop,” we’re dealing with large datasets, requiring a more sophisticated approach to retrieve the right information.
At a high level, the retriever infrastructure runs within a single Cloudflare worker and consists of the following key elements:
The query input component serves as the entry point for the cognitive architecture to query the retriever. It provides a simple-to-use HTTP POST endpoint that accepts a JSON payload with the user query. This component sequentially calls all the necessary functions and returns a JSON object with ranked results. The query input runs on a Cloudflare worker and ultimately returns a structured JSON object containing the semantic chunk (text), source (URL), tables (tabular data), and keywords.
Think about how you typically interact with chatbots. User input is often not the most clear and concise, to put it mildly. However, user expectations are high. To address this, we assume by default that the input might be less than perfect, and we use the user query pipeline to process it into better-formulated queries for search.
The provided input query is run through an LLM using
llama-2-13b-chat-awq
,
hosted on Cloudflare. We use it to extract the following information from the
user query:
The output of the user query function is then sent to the semantic chunk function, where it’s broken into chunks suitable for running search queries on the data stores.
The hallucination function serves a similar purpose as the user query function,
but instead of refining the user query, it enhances the input with additional
content to improve search results. We accomplish this by using an LLM
(@hf/thebloke/llama-2-13b-chat-awq
run through the Cloudflare AI
gateway) to “hallucinate” an
answer to the user query. We use the term hallucinate because the answer
generated isn’t guaranteed to be correct but provides useful additional context
for the query.
For example, if the user query is “What is the capital of France?”, the hallucination pipeline might return, “The capital of France is Paris. Paris is a beautiful city with many attractions such as the Eiffel Tower and the Louvre Museum.” This extra context can improve the relevance of search results.
The output of the hallucination function is sent to the semantic chunk function, where it is broken into chunks suitable for running search queries on the data stores, just like in the user query pipeline.
The semantic chunk pipeline is responsible for breaking search queries into semantic chunks that can be used to run searches on the data stores. The input to the semantic chunk pipeline is a list of questions and keywords extracted by both the user query pipeline and the hallucination pipeline.
Unlike regular chunks, semantic chunks break down the user query based on meaning rather than token count, sentence boundaries, or arbitrary rules. This allows the search engine to better interpret the user’s intent and retrieve more relevant results.
To create these semantic chunks, we first split the text into sentences and
create chunks of three sentences with a two-sentence overlap (i.e., 1,2,3
,
2,3,4
, 3,4,5
). Each chunk is embedded using
@cf/baai/bge-large-en-v1.5
on Cloudflare. We then loop through the chunks recursively, calculating their
cosine similarity based on the vector embedding.
Cosine similarity is the cosine of the angle between the vectors; that is, it is the dot product of the vectors divided by the product of their lengths. In simpler terms, cosine similarity helps us understand how similar two vectors are. Chunks with a cosine similarity of .9 or higher are merged, unless their total length exceeds 500 tokens.
The result is a set of semantic chunks where similar concepts are merged together in chunks of no more than 500 tokens. This process is applied to all inputs, including the output from both the user query function and the hallucination function. The resulting semantic chunks are sent to the pull function to retrieve relevant data from the data stores.
The purpose of the pull function, as the name suggests, is to pull all relevant information from various data stores. If you’ve been following our series, you know we use the following data stores:
The pull pipeline takes the extracted keywords and semantic chunks to retrieve all relevant information. Specifically:
@cf/baai/bge-large-en-v1.5
to retrieve the 20 most relevant documents per semantic chunk.The result is a large amount of relevant data, which could overwhelm the context window of even the largest LLM. That’s where the rank function comes in to prioritize the most relevant data returned by the pull pipeline.
Now, you might ask: “Why don’t you just return less data?” Excellent question. While all of these techniques are designed to return relevant data, they each do so in their own way. The vector store returns the most closely related vectors (cosine similarity), the graph DB returns data about related entities, and so on. While all of this is relevant, we still need to decide which data is the most relevant. For example, if we only returned 10 items, we might end up with an order like this:
In other words, we want to return more data so the ranking algorithms can establish what’s truly relevant. Ranking algorithms can take into account context that individual data stores can’t. For instance, while the vector search works on a single chunk, the full query might contain multiple chunks. The ranking algorithm considers the entire search input, filtering through the returned results from all the stores to return the most relevant pieces.
There are several ways to rank the results from the retrieval pipeline:
Depending on your use case and available data, you might choose one or more of
these ranking methods. Since Lost Minute Travel is a demo application, we
don’t have engagement data yet, nor do we have multiple dataset versions, making
EngagementBoost
and PreferOld/PreferNew
unavailable for now. Because we
indexed Wikipedia, the likelihood of spammy results is low, so we can rule out
AvoidSpam
. As LuckyRank
is primarily for introducing diversity rather than
improving quality, we’ve decided not to use it either. That leaves us with
LLMRank
and TopicRank
.
Resulting in a total of 10 highly relevant documents per input query returned from the retriever pipeline.
Having a human in the loop is highly beneficial when working with large language models. Even the best guardrails and ranking algorithms can’t prevent things from going off-track occasionally. This is exactly what the scoring function is designed to address. Every output from the model is tagged with a unique ID (more on that in the upcoming cognitive architecture blog). Users can provide feedback on each generated output by giving it a thumbs-up or thumbs-down.
Each thumbs up or down stores the output ID, the generated content, the input
query, and the related data points (i.e., the output from the retrieval
pipeline) in a D1 database. Over time, this builds a large dataset that can be
used for the EngagementBoost
ranking algorithm, improving future outputs by
learning from previous user interactions. Additionally it provides insights into
potential blindspots in the data.
Stay tuned for the next edition of this blog series were we will go into great detail about how to design a cognitive architecture and be sure to follow us on LinkedIn to be notified when it’s released. Schedule a free consultation with our team if you want more in-depth information on building a SOTA RAG application yourself or with our help.