A deeper look into RAG Data Extraction

Verify your extracted data


Here is the scenario - you deploy your chat bot for the marketing team and its based off knowledge which are documents they can upload. They start to use the chat bot and all is going well, but after a while they say they have uploaded document with information but the chat bot doesn't reply with the correct information. This could be caused by several things, but I will focus on a very particular issue which is how the data being extracted from the documents.

For some background, RAG, which stands for Retrieval-Augmented Generation, is a technique used in artificial intelligence to improve the way AI models generate responses or answers. It consists of two parts:

  1. Retrieval: The system first searches a large database or collection of documents to find relevant information based on the user's query. This step helps the AI gather facts and context that are directly related to what you're asking about.
  2. Augmented Generation: After retrieving the relevant information, the AI uses this information to generate a more accurate and informative response. Essentially, it "augments" its own knowledge with the external information it retrieved to give you a better answer.

But before we get to retrieving the data, you need to extract the data, usually from documents uploaded by the user. This could come from several sources e.g. documents like pdf, word, excel, databases and any other sources of either structured or unstructured data. There are many python libraries that help you extract data, and I will be focusing on examining data pulled from pdf document.

When you build a RAG application, you usually give end users the ability to upload the documents they would like to use as the knowledge base. In the background you process the documents starting with extracting the infomrmation. Lets say this RAG application is for a marketing team and you will find they will upload some documents which are very visually rich, but difficult to extract data from.

VS Code

The image above is a sample page from a marketing PDF and I have picked on page 16 which has quite a bit of complexity in terms of data extraction - the page is split into two horizontally and then vertically to show information of two different camera models, features, pricing etc. I want to show you the results of using different libraries to extract this information. They are very many out there, but will be testing the below:

  1. PyMuPDF
  2. Unstructured
  3. MarkItDown
  4. Document intelligence

PyMuPDF

This is a popular library used for extracting content from documents. This is the code I wrote to extract the data:

import pymupdf
doc = pymupdf.open("data/canon.pdf")
 
f = open("data/pymupdf_file.txt", "a")
 
for page in doc:
  text = page.get_text()
  print(text)
  f.write(text)
 
f.close()
 

Below is the document output for page 16.

15 | Mirrorless Cameras
EOS R100 (RF-S18-45mm f/4.5-6.3 IS STM)
MRP: `64 995.00/U incl. of all taxes
• 24.1 Megapixel APS-C CMOS Sensor
• 4K 25p/FHD 60p Movie Recording
• HD 120p High Frame Rate Video
• Vertical Video Recording
• Hybrid Auto
• 3.5mm External Microphone Port
per second
A Mirrorless for All
All prices are subject to change. For latest prices, please visit in.canon and check out the relevant product section. All images and effects are simulated.
Actual image may vary.
All images and effects are simulated. Actual images may vary. All prices are subject to change. For latest prices, please visit in.canon and check out the relevant product section.
Make Every Trip an EOS Trip
• 24.1 Megapixel APS-C CMOS Sensor
• ISO 100-6400 (Expandable to ISO 12800)
• 9-Point AF (Centre Cross Type)
• 3.0 fps Continuous Shooting
• Full HD 30p Movie Recording
• Built-in Creative Filters
EOS 1500D (EF-S18-55mm f/3.5-5.6 IS II Lens)
MRP: `49 995.00/U incl. of all taxes
Do Great with Canon
• 24.1 Megapixel APS-C CMOS Sensor
• ISO 100-25600 (Expandable to ISO 51200)
• 9-Point AF (Centre Cross Type)
• 5.0 fps Continuous Shooting
• 4K 24p / Full HD 60p Movie Recording
• Creative Assist Function
EOS 200D II (EF-S18-55mm f/4-5.6 IS STM Lens)
MRP: `68 995.00/U incl. of all taxes
Captured on EOS 200D II, EF-S 18-55 f/4-5.6 IS STM
DSLR CAMERAS
Contents
16 | DSLR Cameras

Pros:

  1. Got the key information for each camera separately
  2. The pricing is in line with the particular camera information

Cons:

  1. Information from the previous page was put onto this page i.e. the content on EOS R100 was from the previous page
  2. The page numbers sometimes dont sync with the information from that page

Unstructured

This is another popular library used for extracting content from documents. This is the code I wrote to extract the data:

from unstructured.partition.auto import partition
 
elements = partition("data/canon.pdf")
print(elements)
 
f = open("data/unstructured_file.txt", "a")
f.write("\n\n".join([str(el) for el in elements]))
f.close()

Below is the document output for page 16.

15 | Mirrorless Cameras
 
DSLR CAMERAS
 
Captured on EOS 200D II, EF-S 18-55 f/4-5.6 IS STM
 
Do Great with Canon • 24.1 Megapixel APS-C CMOS Sensor • ISO 100-25600 (Expandable to ISO 51200) • 9-Point AF (Centre Cross Type) • 5.0 fps Continuous Shooting • 4K 24p / Full HD 60p Movie Recording • Creative Assist Function
 
Make Every Trip an EOS Trip • 24.1 Megapixel APS-C CMOS Sensor • ISO 100-6400 (Expandable to ISO 12800) • 9-Point AF (Centre Cross Type) • 3.0 fps Continuous Shooting • Full HD 30p Movie Recording • Built-in Creative Filters
 
EOS 200D II (EF-S18-55mm f/4-5.6 IS STM Lens) MRP: `68 995.00/U incl. of all taxes
 
EOS 1500D (EF-S18-55mm f/3.5-5.6 IS II Lens) MRP: `49 995.00/U incl. of all taxes
 
All images and effects are simulated. Actual images may vary. All prices are subject to change. For latest prices, please visit in.canon and check out the relevant product section.
 
Contents
 
16 | DSLR Cameras

Pros:

  1. Got the key information for each camera in one line
  2. The footer information is at the bottom

Cons:

  1. The pricing information was separated with the camera detailed information
  2. The camera model information was separated with the camera detailed information

MarkItDown

I recently stubmled across this library by Microsoft and decided to give it a try. This is the code I wrote to extract the data:

from markitdown import MarkItDown
 
md = MarkItDown()
result = md.convert("data/canon.pdf")
print(result.text_content)
 
 
 
f = open("data/markitdown_file.txt", "a")
f.write(result.text_content)
f.close()

Below is the document output for page 16.

15 | Mirrorless Cameras
 
DSLR CAMERAS
 
Captured on EOS 200D II, EF-S 18-55 f/4-5.6 IS STM
 
Do Great with Canon
•  24.1 Megapixel APS-C CMOS Sensor
•  ISO 100-25600 (Expandable to ISO 51200)
•  9-Point AF (Centre Cross Type)
•  5.0 fps Continuous Shooting
•  4K 24p / Full HD 60p Movie Recording
•  Creative Assist Function
 
Make Every Trip an EOS Trip
•  24.1 Megapixel APS-C CMOS Sensor
•  ISO 100-6400 (Expandable to ISO 12800)
•  9-Point AF (Centre Cross Type)
•  3.0 fps Continuous Shooting
•  Full HD 30p Movie Recording
•  Built-in Creative Filters
 
EOS 200D II (EF-S18-55mm f/4-5.6 IS STM Lens)
MRP: ` 68 995.00/U incl. of all taxes
 
EOS 1500D (EF-S18-55mm f/3.5-5.6 IS II Lens)
MRP: ` 49 995.00/U incl. of all taxes
 
All images and effects are simulated. Actual images may vary. All prices are subject to change. For latest prices, please visit in.canon and check out the relevant product section.
 
Contents
 
16 | DSLR Cameras

Pros:

  1. Got the key information for each camera separately
  2. The footer information is at the bottom

Cons:

  1. The pricing information was separated with the camera detailed information
  2. The camera model information was separated with the camera detailed information

Document intelligence

This is not really a library but a service by Microsoft which can be executed via api. It is the best document extractor I have used and what I lean on for most projects. This is the code I wrote to extract the data:

from azure.core.credentials import AzureKeyCredential
from azure.ai.documentintelligence import DocumentIntelligenceClient
 
 
def analyze_layout():
 
    document_intelligence_client = DocumentIntelligenceClient(
        endpoint="https://docservicename.cognitiveservices.azure.com/",
        credential=AzureKeyCredential("<Enter Document Intelligence Key>")
    )
 
    with open(r"data/canon.pdf", "rb") as document:
        poller = document_intelligence_client.begin_analyze_document("prebuilt-layout", document)
 
    AnalyzeResult = poller.result()
    print(AnalyzeResult.content)
    f = open("data/document_intelligence_file.txt", "a")
    f.write(AnalyzeResult.content)
    f.close()
 
analyze_layout()

Below is the document output for page 16.

15 | Mirrorless Cameras » :unselected: :unselected: :unselected: :unselected: :unselected: :selected: :selected:
DSLR CAMERAS
Captured on EOS 200D II, EF-S 18-55 f/4-5.6 IS STM
Canon
EOS 200D II
Do Great with Canon
· 24.1 Megapixel APS-C CMOS Sensor
· ISO 100-25600 (Expandable to ISO 51200)
· 9-Point AF (Centre Cross Type)
· 5.0 fps Continuous Shooting
· 4K 24p / Full HD 60p Movie Recording
. Creative Assist Function
Dual Pixel CMOS
AF
DiG!C 8
Smooth Skin
Eye Detection AF
Wi-Fi Bluetooth®
Creative Assist
EOS 200D II (EF-S18-55mm f/4-5.6 IS STM Lens)
MRP: ₹ 68 995.00/U incl. of all taxes
Canon
EOS
EOS 1500D
Make Every Trip an EOS Trip
· 24.1 Megapixel APS-C CMOS Sensor
· ISO 100-6400 (Expandable to ISO 12800)
· 9-Point AF (Centre Cross Type)
. 3.0 fps Continuous Shooting
· Full HD 30p Movie Recording
· Built-in Creative Filters
24.1
MEGA PIXELS
DiG!C 4+
CMOS
EOS Movie FULL HD
Creative Filters
Wi-Fi / NFC
EOS 1500D (EF-S18-55mm f/3.5-5.6 IS II Lens)
MRP: ₹ 49 995.00/U incl. of all taxes
All images and effects are simulated. Actual images may vary. All prices are subject to change. For latest prices, please visit in.canon and check out the relevant product section.
« :selected:
Contents
16 | DSLR Cameras » :selected:

Pros:

  1. Got the key information for each camera separately
  2. The pricing information is with the camera detailed information
  3. It retrieved text from "image like" content
  4. It was the only one that retrieved the currency symbol with the prices

Cons:

  1. This is a paid for service as opposed to the other libraries that are open source.

Conclusion

As the saying goes "you get what you pay for" and it is clear to see that Microsoft's Document Intelligence service performed way better than any of the other libraries. It is a paid service and thus not really like for like in this case, but it will get you the best data extractions from my experience.

Thus if users were lets say looking for a price of the camera, then you will have the price information with the camera information which is what you want to return a useful result. It will also return the currency symbol as it was able to extract that as well.

Note that each of the libraries also has several features which I have not gone into and I have only used the most basic implementation to extract data. Thus you may be able to get more out of the libraries. Depending on your scenario, hopefully this gives you and idea of when you should use what library.