State Of The Art Retrieval Augmented Generation - Indexer

SOTA RAG Series
This is the third blog in our series building a state-of-the-art retrieval augmented generation SOTA RAG pipeline. You can read up on the full series here:

Introduction
Data Stores
Indexer (you are here)
Retriever (coming soon)
Cognitive Architecture (coming soon)
Systems Architecture and Design (coming soon)

In this third edition of our state-of-the-art retrieval augmented generation (SOTA RAG) blog series, we dive into the main content, starting with the indexer. The indexer breaks down input elements into core components and stores them in various data stores. It also manages dataset versions, updates metadata, and ensures no duplicate data exists in any data sources.

Check out the previous edition of this blog to learn more about the data stores used in the demo application.

The Demo Application
LostMinute Travel is our demo application. It offers users a convenient chat interface to plan their ideal vacation. LostMinute Travel uses a combination of input data to provide the best possible travel advice. This includes data mined from travel brochures, Wikipedia pages, travel blogs, and more.

Input Data

Throughout this blog, we refer to two types of input data: higher-level input data and derived input data.

For our demo application, higher-level input data includes PDF files, web pages, and Wikipedia archive entities. We can split these into lower-level or derived input data, such as extracting images, text, and tables from a webpage.

Architecture

In this section, we discuss the architecture of the indexer pipeline.

Indexers

The indexer application consists of multiple individual Workers configured for specific tasks. Here are the components, with more details on each in the sections below.

Web Indexer - The web indexer takes a web page as input, deconstructs it into its core components such as text, images, and tables, and sends each data component to the relevant downstream processor.
Wikipedia Indexer - The Wikipedia indexer takes a Wikipedia archive as input. It determines the type and, if needed, deconstructs the archive into its core components such as text, images, and tables, and sends them to the relevant downstream processor.
Document Indexer - The document indexer takes a PDF document URL as input, deconstructs it into its core components such as text, images, and tables, and sends them to the relevant downstream processor.
Tabular Indexer - The tabular indexer takes a CSV file or deconstructed component from a higher-level data source and stores it in a D1 relational database.
Image Indexer - The image indexer takes a picture directly from the input or a higher-level data source, describes the picture in text and topics, and sends it to the text processor.
Text Indexer - The text indexer takes the extracted text from any of the higher-level processors and stores it in the relevant data stores.

Each processor runs on Cloudflare Workers. Processors communicate by sending messages over Cloudflare queues and storing their generated data objects in our data lake running on Cloudflare R2 and their metadata in Cloudflare D1.

You can visualize the entire architecture as follows: entire indexer
diagram

Indexers consume a single queue (their input queue) but can produce to multiple queues. For example, the higher-level webpage indexer produces on the image, text, and tabular queues. All processors read input data from a single R2 bucket but can write to multiple output buckets. This means that the web indexer writes output objects to three buckets (text, tabular, and image) and produces messages on the related queues. A generic indexer looks something like this:

generic indexer
diagram

Making some of these indexers work on the Cloudflare ecosystem was sometimes harder than expected due to the limitations of the ecosystem. Many native Node libraries are not supported.

Deduplication and Data Versioning

We store metadata about each input object and all derived lower-level data such as tables, images, and text. This metadata includes a dataset version set to the received time in epoch seconds. Storing a dataset version is important for various reasons.

It allows the pipeline to filter and serve data based on a version that matches the user’s input and helps us debug the system more easily once live. For example, the application operator can understand if issues with the system only arise with a certain version of the data. Furthermore, allows the user to selectively delete data for for example GDPR or other privacy regulation requests.

In other words data management and lineage.

Additionally, each metadata profile includes a SHA signature of the input object. This allows us to easily check for duplicates. If the metadata table already includes the same input signature, we reject the data point based on duplication constraints.

Web Indexer

Receives From	Produces For
HTTP input	Image Processor
Wikipedia Processor	Text Processor
	Tabular Processor

The web indexer takes HTML pages as input and extracts relevant derived input data such as text, images, and tabular data.

We use the html-to-text library to extract text from raw HTML. This library is fully compatible with Cloudflare Workers. The web indexer sends the extracted text directly to the text indexer without any additional processing.

To extract all images from a webpage, we use the Cloudflare built-in library called HTML-rewriter. This library allows us to search for all img tags on a page and retrieve the src URL. The web indexer fetches the images, stores them in our data lake (R2), and sends a message to the image indexer to process them.

Extracting image URLS from HTML

/**
 * Extracts image URLs from an HTML string.
 * 
 * @param htmlString - The HTML string to extract image URLs from.
 * @param baseUrl - The base URL to resolve relative URLs against.
 * @returns A promise that resolves to an array of extracted image URLs.
 */
async function extractImageUrlsFromHtml(htmlString: string, baseUrl: string): Promise<string[]> {

	// create list to store image urls
	const imageUrls: string[] = [];

	// Create a new HTMLRewriter instance and extract image URLs
	const rewriter = new HTMLRewriter()
		.on('img', {
			element(element) {
				const src = element.getAttribute('src');
				if (src) {
					// Resolve relative URLs against the base URL
					const resolvedUrl = new URL(src, baseUrl).toString();
					imageUrls.push(resolvedUrl);
				}
			}
		});

	const response = new Response(htmlString);
	await rewriter.transform(response).text();

	// return all image urls
	return imageUrls;
}

We use tabletojson to retrieve any tabular data from the HTML page. This library is very easy to use and provides a fast interface for extracting tables from web pages. It takes raw HTML as input and produces an array of JSONs of table data. For example, imagine you have the following table on your HTML page.

Name	Age	City
Alice	30	New York
Bob	25	Boston
Carol	28	Chicago

Tabletojson returns it as an array of arrays of JSON objects. The extracted table data is sent to the tabular pipeline for further processing.

Example output tabletojson

[
  [
    {
      "Name": "Alice",
      "Age": "30",
      "City": "New York"
    },
    {
      "Name": "Bob",
      "Age": "25",
      "City": "Boston"
    },
    {
      "Name": "Carol",
      "Age": "28",
      "City": "Chicago"
    }
  ]
]

As you can see, the HTML indexer is what we call a higher-level indexer. It extracts lower-level data components from its input and passes them along the pipeline for indexing at a later stage. We use this concept in many places in the indexer pipeline.

By utilizing Cloudflare Workers and queues in this way, we can scale our pipeline indefinitely (or until Cloudflare runs out of compute resources). This greatly improves the indexing speed and allows us to control the scaling of each element individually, giving us more control over our spending.

Wikipedia Indexer

Receives From	Produces For
Object Notification	Image Processor
	Text Processor

The Wikipedia indexer is designed to take in Wikipedia archives known as Zim archives. Wikipedia provides various downloads of their entire dataset. We opted for the Zim archive as it includes not only the raw HTML but also any images on Wikipedia.

The Wikipedia processor is the only processor on the list that has a locally run component. This component is required due to the limited processing time, memory and queue size of Cloudflare Workers and queues. These limits do not work for a dataset the size of all of Wikipedia. The latest download of the English Wikipedia is ~100GB. Looping through the entire archive can take hours, and Cloudflare Workers are limited to a 30-second runtime, which is not nearly enough to process the entire archive.

Instead, we wrote a local component that loops through the entire archive, looks for HTML and image files, and uploads them individually to our data lake R2. We configured an event notification on R2 to start the processing of the uploaded file.

The Wikipedia processor is once again a simple worker that routes the input requests to the relevant downstream indexers. In this case, it sends web pages to the web indexer and images to the image indexer, as one might expect.

To keep things simple downstream, we decided to only consider objects of type html/text, image/png, and image/jpg. This constitutes about 99% of all entities in the archive. When working with data at this scale, it is important to properly assess the need for highly specified indexers. We could have adjusted the image indexer to enable input for every image type under the sun. However, this would take a considerable amount of time and only yield a very minor increase in indexed objects.

Document Indexer

Receives From	Produces For
HTTP Input	Image Processor
	Text Processor
	Tabular Processor

The document indexer might seem like one of the simplest indexers to create, as almost every basic RAG tutorial does exactly that.

However, these simple tutorials only look at the text content of the document. As you can see from the table above, we extract text, images, and tabular data from documents.

Building a document processor on Cloudflare is more challenging than running one on your laptop, primarily due to the limitations of the worker ecosystem. Not all libraries are supported on Cloudflare Workers; specifically, node standard libraries such as FS and Canvas do not work. Almost every single PDF library has a dependency on FS and, as a result, does not work on Workers.

Luckily, there are specific libraries designed for serverless architectures like Workers. We opted for unpdf. It is essentially pdfjs minus the libraries that Cloudflare does not support. Exactly what we need!

Extracting Text from PDFs

Unpdf makes extracting text from PDFs trivial. The extracted text is sent directly to the text processor. One thing to note is that this includes all text, including the text in tables. This creates a bit of redundancy, but we expect that to be filtered out by the rank and score algorithms in our retriever (more on the retriever in our next blog).

Extract text from PDF

import { extractText } from 'unpdf';
const { totalPages, text } = await extractText(objectBuffer, { mergePages: true });

Extracting Images from PDFs

That leaves us with image and tabular extraction, both of which are unfortunately not as straightforward as one might expect.

Images in PDFs are generally contained in XObjects (External Objects). These external objects contain metadata about the images and the images themselves in binary form. Nearly all images in PDFs are either PNG or JPEG. To avoid implementing a long tail of file formats, we decided to focus solely on these two file types.

JPEG images can be directly extracted from the XObject and sent to the image pipeline. PNG images, however, have to be decompressed and reconstructed. This is due to the way these images are compressed.

Using the metadata contained in the XObject, we can reconstruct the deflated PNG images.

Converting image data to PNG

function convertPNG(data: object, width: number, height: number) {


	// Convert the input data to a Uint8Array
	const imageDataArray = new Uint8Array(Object.values(data));

	// Create a raw image buffer
	const rgba = new Uint8Array(width * height * 4);
	for (let i = 0; i < imageDataArray.length / 3; i++) {
		rgba[i * 4] = imageDataArray[i * 3];       // Red
		rgba[i * 4 + 1] = imageDataArray[i * 3 + 1]; // Green
		rgba[i * 4 + 2] = imageDataArray[i * 3 + 2]; // Blue
		rgba[i * 4 + 3] = 255;                      // Alpha
	}

	// Encode the raw image buffer to PNG format
	const png = UPNG.encode([rgba.buffer], width, height, 0);
	return png

}

If you want to learn more about extracting image data from PDFs, schedule a call with us. We are happy to elaborate during a free consultation session.

Extracting Tables

Extracting tables presented quite a challenge. We couldn’t find a specific library for this task (if you know one, let us know!). We tried various approaches, such as converting the PDF into images and using a vision model to extract tables. This method proved difficult and yielded too many false positives, i.e., it often included the text surrounding the table. Furthermore, the lack of support for Canvas in Cloudflare Workers meant we had to look for a different approach.

The best approach involved feeding the extracted text into an LLM and asking it to extract tables in HTML format. We then processed the LLM output with the htmltojson library to extract tables. While not 100% accurate, the results were acceptable, especially since the tabular data also appears in the extracted text and is indexed twice.

We are hopeful that the development and improvement of multimodal models will make this task a lot easier in the future.

System prompt

You are a PDF table extractor, a backend processor.
- User input is messy raw text extracted from a PDF page by PDF.js.
- Do not output any body text, we are only interested in tables.
- The goal is to identify tabular data, and reproduce it cleanly as an HTML table.
- Reproduce each separate table found in page.

User Prompt

raw pdf text; extract and format tables: ${page_text}

Tabular Indexer

Receives From	Produces For
Ingest	NA
Web Processor
Wikipedia Processor
Document Processor

The tabular pipeline takes a CSV file through the ingest location or an extracted table from a higher-level data source and stores the data in D1. It creates a new table in D1 for each table it receives.

Cloudflare sets the default storage limits for D1 to 250GB per account and 10GB per database. You can contact Cloudflare support to increase the former limit. We recommend doing so, as any serious SOTA RAG application will likely need more than 250GB of storage. However, the latter is a hard limit. Our application manages this by creating multiple databases as needed and sharding the data between them.

Image Indexer

Receives From	Produces For
Ingest	Text Processor
Web Processor
Wikipedia Processor
Document Processor

The image indexer takes input from any higher-level data source and from images directly provided through the ingest endpoint. It uses llava-1.5-7b-hf to generate textual image descriptions and resnet-50 to generate image classifications.

The image indexer sends both the textual description and the image description to the text indexer. It uses the image classifications as entities for the graph database, while the text description is indexed as regular text. More on the graph database and entity relations in the section below.

Text Indexer

Receives From	Produces For
Ingest	NA
Web Processor
Wikipedia Processor
Document Processor
Image Processor

All roads lead to Rome, or in our case, all roads lead to the text indexer. Ultimately, almost all inputs (except for the tabular data) convert into text and are processed in the text indexer.

The text indexer takes the text input and cuts it into contextual chunks using semantic chunking. Semantic chunking groups information based on meaning and context, while regular chunking uses fixed criteria like word count. This method creates coherent, meaningful units, making it easier to understand and remember the information. By focusing on the relationships between words, semantic chunking is more effective for tasks like natural language processing and information retrieval.

We identify the language of each chunk using 51-languages-classifier. We use this to identify non-English text from the Wikipedia input - Many articles contain non-English words - and mark them as such to filter them out of any vector search. In a future edition, we consider translating such text into English.

We then extract the topics and entities using Babelscape wikineural-multilingual-ner but we have plans to update this to Llama-3 NER/Topic/Relationship for better performance. These topics and entities form the basis for our graph database records. Each text chunk relates to one or more topics and entities. The text pipeline stores this information in AWS Neptune Analytics. Neptune Analytics fits our use case but is quite pricey.

We initially picked Neo4J as our graph database provider. However, Neo4J turned off REST API access for the hosted version since version 3.5. Buying a full license and hosting it ourselves is prohibitively expensive and complex. Furthermore, we couldn’t use Neo4J through WebSockets without making changes to the Neo4J library. This would have taken days. Using AWS Neptune offered a much easier solution, and while it required the usual struggle with IAM policies, it was still faster than patching the Neo4J library ourselves to make it work in the Cloudflare ecosystem.

The text pipeline converts the same text into vector embeddings using bge-base-en-V1.5 and stores them in Cloudflare Vectorize. Due to the Vectorize limitations of 2 million vectors per index, it performs some overhead in sharing the data over multiple indexes.

Finally, it takes the same text chunks once again and pushes them together with their topics into Typesense enabling regular text search.

After indexing all of our input data we are ready to build out the retriever!

Stay tuned for the next edition of this blog series and be sure to follow us on LinkedIn to be notified when it’s released. Schedule a free consultation with our team if you want more in-depth information on building a SOTA RAG application yourself or with our help.