In the midst of the buzz surrounding Large Language Models (LLMs), one term that has emerged is retrieval augmented generation (RAG). This article aims to provide a clear perspective on RAG, shedding light on its definition, practical applications, limitations, and potential for the future.
RAG is an approach to teaching pre-trained LLMs new information. This becomes particularly important when considering the staggering costs associated with training large language models.
For instance, GPT-4, a popular LLM, is said to have incurred a mind-boggling cost of approximately $100 million. While this might be mere pocket change for well-funded companies such as OpenAI, backed by tech juggernaut Microsoft, it remains an insurmountable barrier for most organizations. Fortunately, RAG can enable us to teach pre-trained LLMs new information. Who said you couldn’t teach an old dog new tricks?!
The RAG approach involves presenting the LLM with a collection of new information, followed by a prompt about that information. This technique is particularly valuable for concepts previously unknown to the LLM. Let’s look at a simple example. We first introduce the LLM to a set of category examples and subsequently prompt it to categorize an unfamiliar sample using the newly acquired category system.
PROMPT: car = category 1, bike = category 2, skateboarding = category 2, truck = category 1, cycling = category 2. Which category belongs to rollerblading?
Output: Rollerblading is in the same category as skateboarding which is in category 2.
For the trained eye, we separated the categories by motor-based and human-powered vehicles. While this is a completely made-up example, the LLM accurately picked the category without retraining the model.
You could make the case that the LLM already had some knowledge about rollerblading and skateboarding, causing it to group them together based on their shared characteristics instead of recognizing them as distinct forms of human-powered transportation. Nevertheless, it did manage to place them in the correct category.
While this is a simple example, the potential for extending RAG to a larger scale example should not be overlooked. For example, if your company uses an internal documentation system, such as Google Docs, Notion, Jira, or any other knowledge management platform, chances are you’ve experienced the phenomenon of these systems gradually evolving into information black holes sucking the life out of your productivity.
While that might sound a bit discouraging, with the help of an LLM, we can transform those information repositories into a Chat-GPT-style platform trained on your internal knowledge base.
By simply inputting all the information from the repository into the LLM (this is our retrieval augmented generation step), you can ask it any question you like and get meaningful answers.
For example, if you include all the documents as context, the LLM can easily address questions like, “What’s our strategy for introducing new products?” or “When do we usually publish new blog posts?” It can answer any question you like as long as the answers live somewhere in that treasure trove of internal documentation.
However, this approach has two main problems, and the remainder of this blog will offer solutions to these common issues.
First of all, providing the LLM with all the information stored in your knowledge repository decreases the chances of it delivering accurate answers. To illustrate, if only about 1% or even less of the documents you input are truly relevant, then the chances of getting a wrong answer are significantly higher.
Secondly, most, if not all, LLMs are limited by context window. This essentially means there’s a cap on the amount of input they can handle. For instance, GPT-3.5 has a context window of 4,000 tokens, which translates to approximately 5,000 words. Although this might seem substantial, it’s highly unlikely to be enough to accommodate your entire company’s knowledge repository.
Luckily there is a solution, vector databases!
RAG does come with its share of challenges, particularly when it comes to the context window restriction of LLMs. However, there’s a potential solution in the form of vector DBs. But before we dive into this solution, let’s take a quick moment to understand what vector DBs are, how they function, and how we can leverage them for RAG purposes.
Vector DBs, as the name might suggest, store vectors; you know those pesky things you had to learn about for your linear algebra class?
Putting humor aside, the integration of vector DBs with vector embeddings (a method for converting text into vector forms) offers a great way of transforming and storing vector representations of our knowledge repository. You might be familiar with popular tools such as Pinecone, Quadrant, ChromaDB, Vectorize by Cloudflare etc.
By transforming our text into vectors and housing them within a vector DB, we can tap into a range of useful possibilities. For example, we can now employ search algorithms such as k-nearest neighbor (KNN) and other methods to run similarity searches between input queries and our documents.
This approach has the potential to tackle the two issues RAG and LLMs have to overcome. Instead of overwhelming the LLM with our entire knowledge repository, we can now selectively feed it the most relevant documents i.e. KNN, where K=7 will only return the seven most relevant docs. This resolves the problem of context window and ensures that the LLM is exposed only to materials that are truly relevant.
A retrieval augmented generation pipeline utilizing a vector DB might look something like this.
In step one, the user inputs their query. The query is embedded using the same embeddings that were used to turn all documents into vectors. A KNN search is run inside the vector DB to get the K most similar documents. The output of the search + the initial query is sent to the LLM where we ask it.
Answer this question : [query] based on this information [most relevant docs].
This approach has a couple of notable advantages. Most importantly, we can send only the relevant text to the LLM for processing. Reducing the need for very large context windows.
Using vector DBs and any LLM enables anyone to build their own chat GPT on private data without spending millions on training models.
RAG is a great tool to build LLM-powered chat applications on your data sources as long as you have access to a vector database, a platform to create embeddings, and an LLM (which you might have guessed are all included in the Cloudflare platform).
As we mentioned earlier, a key motivation for utilizing a vector DB in the context of RAG is to trim down the size of your input queries. However, if you’ve been following developments in the field of LLMs, you’ve likely noticed a continuous expansion of the context window of these models.
For instance, while GPT-3.5 started with a context window of 4096 upon its initial release, the latest iteration, GPT-4, supports 32,768. Even more impressively, there are already open-source LLMs with even larger context windows, such as Claude 3, which supports an astonishing 100K context window.
Given that these model context windows are doubling at an almost exponential rate (a nod to Moore’s law), you might naturally wonder whether a vector DB is still needed for retrieval augmented generation much longer.
While it’s challenging to predict the exact future capabilities of these models, we think that vector DBs will continue to play a significant role in the domain of retrieval augmented generation, and there are several reasons why.
For starters, commercial LLMs are likely to continue to charge based on the number of tokens used. Meaning it’s in your wallet’s best interest to minimize the length of your queries. A simple similarity search can be a valuable tool. Reducing the input prompt can make the difference between a $1K monthly bill and a $10K monthly bill.
Furthermore, even though providing a lot of context often leads to better LLM performance, there’s a limit to its effectiveness. Overloading your LLM with excessive information might eventually lead to a worse reply. Increasing the relevance of your input query using vector DBs will only strengthen the output. Remember the old saying garbage in garbage out, it still applies today!