Preprocessing Unstructured Data for LLM Applications
Overview
This is the note that I took when learning this lecture from DeepLearning.AI.
I made a lot effort to organise everything in the lecture, into a single, coherent document. I hope this helps you.
This notebook talked about.
- Processing documents:
- content
- elements: title, narritive, table, image
- metadata: page number, file type, file name
- RAG: retrieve context from database and inject this to the prompt for an LLM
Normalising Content
We want to convert raw documents to a common format, so LLMs treat everything in the same way.
We will then perform data serialisation to stored the pre–processed content. (In this course we will use json.)
import os
import json
from pprint import pprint
import warnings; warnings.filterwarnings('ignore') # noqa
from IPython.display import Image
from unstructured_client import UnstructuredClient
from unstructured_client.models import shared
from unstructured_client.models.errors import SDKError
from unstructured.partition.html import partition_html
from unstructured.partition.pptx import partition_pptx
from unstructured.staging.base import dict_to_elements, elements_to_json
Process HTML (local)
The HTML page we are going to process looks like the image blow.
There are images, titles, and other web elements.
="images/HTML_demo.png", width=250) Image(filename
We use the partition_html function to process the html file.
We get a list
of unstructured elements back.
= "example_files/medium_blog.html"
filename = partition_html(filename=filename)
elements
= set([type(e) for e in elements])
element_types
pprint({"number": len(elements),
"types": element_types,
})
{'number': 72,
'types': {<class 'unstructured.documents.elements.NarrativeText'>,
<class 'unstructured.documents.elements.ListItem'>,
<class 'unstructured.documents.elements.Title'>,
<class 'unstructured.documents.elements.Text'>}}
We can use the to_dict
method to get the information / resource for each element!
0].to_dict(), width=60, depth=2) pprint(elements[
{'element_id': '58ce54f95c4ba051d7ba46498beed83d',
'metadata': {'category_depth': 0,
'file_directory': 'example_files',
'filename': 'medium_blog.html',
'filetype': 'text/html',
'languages': [...],
'last_modified': '2024-08-28T09:58:39',
'link_texts': [...],
'link_urls': [...]},
'text': 'Open in app',
'type': 'Title'}
Parsing Power Points (local)
We can parse the power point slides with partition_pptx.
The slides to be ingested looks like below.
="images/pptx_slide.png", width=300) Image(filename
= "example_files/msft_openai.pptx"
filename = partition_pptx(filename=filename)
elements
= set([type(e) for e in elements])
element_types
print("Elements Information")
pprint({"number": len(elements),
"types": element_types,
})
print("\nElement #1 content")
0].to_dict(), width=60, depth=2) pprint(elements[
Elements Information
{'number': 7,
'types': {<class 'unstructured.documents.elements.ListItem'>,
<class 'unstructured.documents.elements.Title'>}}
Element #1 content
{'element_id': 'e53cb06805f45fa23fb6d77966c5ec63',
'metadata': {'category_depth': 1,
'file_directory': 'example_files',
'filename': 'msft_openai.pptx',
'filetype': 'application/vnd.openxmlformats-officedocument.presentationml.presentation',
'languages': [...],
'last_modified': '2024-08-28T09:58:40',
'page_number': 1},
'text': 'ChatGPT',
'type': 'Title'}
Parsing PDF (API)
We can parse PDF files using partition_pdf, like the previous examples.
However, we are doing something different this time. We will use the Python SDK of the unstructured API.
In this example, we will use unstructured API, sending the PDF to the cloud to process.
This is the how the data looks like
="images/cot_paper.png", width=250) Image(filename
To use the API, we need to connect to the unstructured service.
We can get a free unstructured API key at the moment (~2024).
In this example, I stored the API key in the enviromental variable TOKEN_UNSTRUCTURED
.
We can use the dotenv.load_dotenv
function to get the value of the API key.
from dotenv import load_dotenv
load_dotenv()
= os.getenv("TOKEN_UNSTRUCTURED")
DLAI_API_KEY = "https://api.unstructured.io/general/v0/general"
DLAI_API_URL
= UnstructuredClient(
s =DLAI_API_KEY,
api_key_auth=DLAI_API_URL,
server_url )
= "example_files/CoT.pdf"
filename
with open(filename, "rb") as f:
=shared.Files(
files=f.read(),
content=filename,
file_name
)
= shared.PartitionParameters(
request =files,
files='hi_res',
strategy=True,
pdf_infer_table_structure=["eng"],
languages
)try:
= s.general.partition(request)
response except SDKError as e:
print(e)
INFO: Preparing to split document for partition.
INFO: Starting page number set to 1
INFO: Allow failed set to 0
INFO: Concurrency level set to 5
INFO: Splitting pages 1 to 1 (1 total)
INFO: Determined optimal split size of 2 pages.
INFO: Document has too few pages (1) to be split efficiently. Partitioning without split.
INFO: Successfully partitioned the document.
The response is a standard HTTP response so we get the status code and content type.
response.status_code, response.content_type
(200, 'application/json')
Like the local process case, we get a collection of elements in the response.
But in this case, each element is a dict
object, instead of an unstructured element.
= response.elements
elements
= set([type(e) for e in elements])
element_types
print("Elements Information")
pprint({"number": len(elements),
"types": element_types,
})
print("\nElement #1 content")
0], width=60, depth=2) pprint(elements[
Elements Information
{'number': 6, 'types': {<class 'dict'>}}
Element #1 content
{'element_id': '826446fa7830f0352c88808f40b0cc9b',
'metadata': {'filename': 'CoT.pdf',
'filetype': 'application/pdf',
'languages': [...],
'page_number': 1},
'text': 'B All Experimental Results',
'type': 'Title'}
Metadata Extraction
Introduction
Metadata includes
- Document details: additional information to enrich the content.
- Source identification: filename, url, filetype
- Structural information: hierarchy indicator (section, …)
- Search enhancements: tags that can be used for filtering …
Hybrid Search
Pure semantic search (with embeddings) may not work because of
- too many matches
- user prefer new information
- important information are indeed in the metadata
We use “hybrid search” instead, applying techniques such as filtering and keyword search.
Metadata is helpful for hybrid search.
Parsing EPUB (API)
Let’s take an example where we will parse an EPUB file.
It’s a long book. We want metadata like pages etc. The cover and outline looks like below.
='images/winter-sports-cover.png', width=150) Image(filename
="images/winter-sports-toc.png", width=150) Image(filename
Again, we use the web API to process the file.
= "example_files/winter-sports.epub"
filename
with open(filename, "rb") as f:
=shared.Files(
files=f.read(),
content=filename,
file_name
)
= shared.PartitionParameters(files=files, content_type='epub')
request
try:
= s.general.partition(request)
response except SDKError as e:
print(e)
INFO: Preparing to split document for partition.
INFO: Given file doesn't have '.pdf' extension, so splitting is not enabled.
INFO: Partitioning without split.
INFO: Successfully partitioned the document.
Let’s checkout the response. Just like before but now we have a lot (716) elements.
= response.elements
elements
= set([type(e) for e in elements])
element_types
print("Elements Information")
pprint({"number": len(elements),
"types": element_types,
})
print("\nElement #1 content")
0], width=60, depth=2) pprint(elements[
Elements Information
{'number': 716, 'types': {<class 'dict'>}}
Element #1 content
{'element_id': 'c2994d1baf27b22f630c8f044664879d',
'metadata': {'category_depth': 1,
'filename': 'winter-sports.epub',
'filetype': 'application/epub',
'languages': [...]},
'text': 'The Project Gutenberg eBook of Winter Sports in '
'Switzerland, by E. F. Benson',
'type': 'Title'}
Hierarchy in Elements
Notice in each element, we have two attributes that indicate the hierarchical structure of these elments. Namely,
element_id
metadata/parent_id
We can recover a Tree data structure from these two.
= 9
idx print(f"\nElement #{idx + 1} content\n")
=60, depth=2) pprint(elements[idx], width
Element #10 content
{'element_id': 'aa4b17534a6fb8c01a05440474d89f99',
'metadata': {'filename': 'winter-sports.epub',
'filetype': 'application/epub',
'languages': [...],
'parent_id': 'ec704441617fcfa0d3dbbcc2e9c5570a'},
'text': '*** START OF THE PROJECT GUTENBERG EBOOK WINTER '
'SPORTS IN SWITZERLAND ***',
'type': 'UncategorizedText'}
Recover Chapter detail
Here is an example. We can find the elements, that “belong” to each chapter titles.
If we know these “root nodes” we can find their children by transversing the tree!
= [
chapters "THE SUN-SEEKER",
"RINKS AND SKATERS",
"TEES AND CRAMPITS",
"ICE-HOCKEY",
"SKI-ING",
"NOTES ON WINTER RESORTS",
"FOR PARENTS AND GUARDIANS",
]
Step 1: to get the element ID for these root nodes.
= {}
chapter_ids for element in elements:
for chapter in chapters:
= element["text"]
e_text = element["type"]
e_type if (chapter in e_text) and e_type == "Title":
"element_id"]] = chapter
chapter_ids[element[break
chapter_ids
{'65af5da00154a2f526b43177bbad3189': 'THE SUN-SEEKER',
'4e3f02ce1525ca1a197b5d5e9cd1953d': 'RINKS AND SKATERS',
'c2f9b5a30cb07d9adaf5f390dc8e7564': 'TEES AND CRAMPITS',
'67d58ab3aae6e41e9ce429dc4cbe5501': 'ICE-HOCKEY',
'ea7faf4689009a3b72bf19dc10f29037': 'SKI-ING',
'ea3feb77c48592d21a05386e68fde88e': 'NOTES ON WINTER RESORTS',
'56c4ba79d33a505bb9b4480d7a5a4ce7': 'FOR PARENTS AND GUARDIANS'}
Step 2. Find children.
Now we find the childern IDs for each chapters.
(To better format the output, I limited the element size for each chapter to be <= 5.)
= 5
MAX_LENGTH = {v: k for k, v in chapter_ids.items()}
chapter_to_id
= {}
children for chapter in chapters:
= []
children[chapter] = chapter_to_id[chapter]
parent_id for e in elements:
if e['metadata'].get("parent_id") == parent_id:
children[chapter].append(e)if len(children[chapter]) >= MAX_LENGTH:
break
=2) pprint(children, depth
{'FOR PARENTS AND GUARDIANS': [{...}, {...}, {...}, {...}, {...}],
'ICE-HOCKEY': [{...}, {...}, {...}, {...}, {...}],
'NOTES ON WINTER RESORTS': [{...}, {...}, {...}, {...}, {...}],
'RINKS AND SKATERS': [{...}, {...}, {...}, {...}, {...}],
'SKI-ING': [{...}, {...}, {...}, {...}, {...}],
'TEES AND CRAMPITS': [{...}, {...}, {...}, {...}, {...}],
'THE SUN-SEEKER': [{...}, {...}, {...}, {...}, {...}]}
Ingesting Documents to VDB
Vector database are convient ways to store the embeddings and metadata for elements.
The embeddings can be used for semantic search.
The metdata can be used to aid hybrid search.
Vector Database.
In this section we will use chromadb
as the vectore database.
By default, chromadb
use minilm-L6-v2 as the embedding model.
import chromadb
= chromadb.PersistentClient(
client ="chroma_tmp",
path=chromadb.Settings(allow_reset=True),
settings
) client.reset()
INFO: Anonymized telemetry enabled. See https://docs.trychroma.com/telemetry for more information.
True
= client.create_collection(
collection ="winter_sports",
name={"hnsw:space": "cosine"}
metadata )
Ingesting Documents
The vector database chromadb automatically calculates the embeddings for us.
However, the calculation is performed on CPU by default and it is very slow. We can use GPU for much better performance. But this is another topic which will not include here.
for element in elements:
= element["metadata"].get("parent_id")
parent_id = chapter_ids.get(parent_id, "")
chapter
collection.add(=[element["text"]],
documents=[element["element_id"]],
ids=[{"chapter": chapter}]
metadatas )
Query the database
We can use the peek
method to check out items in the database.
= collection.peek(limit=3)
results "documents"]) pprint(results[
['• You provide a full refund of any money paid by a user who notifies you in '
'writing (or by e-mail) within 30 days of receipt that s/he does not agree to '
'the terms of the full Project Gutenberg™ License. You must require such a '
'user to return or destroy all copies of the works possessed in a physical '
'medium and discontinue all use of and all access to other copies of Project '
'Gutenberg™ works.',
'[Image unavailable.]',
'Each call must be skated at least twice, beginning once with the right foot '
'and once with the left.']
We can use query
method to perform semantic search.
We can use where
parameter to filter out unwanted items.
= collection.query(
result =["How many people are on a team?"],
query_texts=2,
n_results={"chapter": "ICE-HOCKEY"},
where
) pprint(result)
{'data': None,
'distances': [[0.6024625301361084, 0.8077031970024109]],
'documents': [['It is a wonderful and delightful sight to watch the speed and '
'accuracy of a first-rate team, each member of which knows the '
'play of the other five players. The finer the team, as is '
'always the case, the greater is their interdependence on each '
'other, and the less there is of individual play. Brilliant '
'running and dribbling, indeed, you will see; but as '
'distinguished from a side composed of individuals, however '
'good, who are yet not a team, these brilliant episodes are '
'always part of a plan, and end not in some wild shot but in a '
'pass or a succession of passes, designed to lead to a good '
'opening for scoring. There is, indeed, no game at which team '
'play outwits individual brilliance so completely.',
'For the rest, everybody knows the “sort of thing” hockey is, '
'and quite rightly supposes that ice-hockey is the same “sort '
'of thing” played on a field of ice by performers shod in '
'skates. As is natural, the practice and ability which enable '
'a man to play ordinary hockey with moderate success are a '
'large factor in his success when he woos the more elusive '
'sister-sport; another factor, and one which is not '
'sufficiently appreciated, is the strength of his skating. It '
'is not enough to be able to run very swiftly on the skates: '
'no one is an ice-hockey player of the lowest grade who cannot '
'turn quickly to right or left, start quickly, and above all, '
'stop quickly. However swift a player may be, he is '
'practically useless to his side unless he can, with moderate '
'suddenness, check his headlong career, turn quickly, and when '
'the time comes again start quickly.']],
'embeddings': None,
'ids': [['25e6c2912acbc885f6d450e94497e432',
'1f14432f4482db83e2fa6b28bb90ea4e']],
'included': ['metadatas', 'documents', 'distances'],
'metadatas': [[{'chapter': 'ICE-HOCKEY'}, {'chapter': 'ICE-HOCKEY'}]],
'uris': None}
Chunking the Content
Introduction
- Chunking Necessity: Vector databases need documents split into chunks for retrieval and prompt generation.
- Query Result Variability: The same query will return different content depending on how the document is chunked.
- Even Size Chunks: The easiest way is to split the document into roughly even size chunks. This can result in similar content getting split across chunks.
Chunking by Atomic Elements
By identifying atomic elements, you can chunk by combining elements rather than splitting raw text.
- Results in more coherent chunks
- Example: combining content under the same section header into the same chunk.
We can use the chunk_by_title
function to achieve this.
How is this done?
- partitioning documents into atomic elements
- combine elements until a breaking condition is satisfied.
What are the break conditions?
- elements are having a new title/section
- element similarity value exceeds a threshold
- element metadata (page) having a new value.
We need to experiment a bit to find the good stragety. Here is an example.
from unstructured.chunking.basic import chunk_elements
from unstructured.chunking.title import chunk_by_title
from unstructured.staging.base import dict_to_elements
= dict_to_elements(elements)
elements print(f"There are {len(elements)} elements ({type(elements[0])})")
There are 716 elements (<class 'unstructured.documents.elements.Title'>)
Now we apply the chunk_by_title
function where the break condition is:
Splits off into a new CompositeElement when a title is detected or if metadata changes, which happens when page numbers or sections change. Cuts off sections once they have exceeded a character length of max_characters.
= chunk_by_title(
chunks
elements,=100,
combine_text_under_n_chars=3000, # maximum number for a chunk
max_characters
)
print(f"There are {len(chunks)} chunks after `chunk_by_title`")
There are 248 chunks after `chunk_by_title`
print(chunks[1])
Title: Winter Sports in Switzerland
Author: E. F. Benson
Illustrator: C. Fleming Williams
Photographer: Mrs. Aubrey Le Blond
Release date: August 23, 2019 [EBook #60153] Most recently updated: January 30, 2020
Preprocessing PDFs and Images
Introduction: DLD and ViT
Sometimes we need visual information to process the documents. Here we will use two techniques
- Document layout detection (DLD)
- Vision transformers (ViT):
For the DLD approach, we use object detection model (typically, yolo) to draw bounding boxes.
- vision detection: identify and classify bounding boxes
- (optional) text extraction: to perform OCR within bounding boxes
For the ViT approach, We often use the document understanding transformer (DONUT) architecture (arxiv paper), where
- the image is passed to the transformer encoder, and the decoder generates the output.
- OCR is not needed
- We can train the model to output a valid JSON string.
Example
Now we will parse a document that looks like below.
The title should be
A potent pair of atmospheric rivers …
="images/el_nino.png", width=300) Image(filename
Parsing HTML (Pure Text, Local)
If we parse the HTML page directly, we get some results.
However, there are issus. Most notably, header/metadata content are recognised as the title.
= "example_files/el_nino.html"
filename = partition_html(filename=filename)
html_elements
for element in html_elements[:10]:
if len(element.text) > 40:
print(f"{element.category.upper(): <16}: {element.text[:40]} ...")
else:
print(f"{element.category.upper(): <16}: {element.text}")
UNCATEGORIZEDTEXT: CNN 1/30/2024
TITLE : A potent pair of atmospheric rivers will ...
TITLE : By Mary Gilbert, CNN Meteorologist
TITLE : Updated: 3:49 PM EST, Tue January 30, 20 ...
TITLE : Source: CNN
NARRATIVETEXT : A potent pair of atmospheric river-fuele ...
NARRATIVETEXT : The soaking storms will raise the flood ...
NARRATIVETEXT : El Niño – a natural phenomenon in the tr ...
NARRATIVETEXT : El Niño hasn’t materialized many atmosph ...
NARRATIVETEXT : But all that is set to change this week.
Parsing PDF (Direct, Local)
The fast
stragety from unstructured can parse the PDF file by working on the text directly.
However, there are issues. Most notablly, the title is wrong!
from unstructured.partition.pdf import partition_pdf
= "example_files/el_nino.pdf"
filename = partition_pdf(filename=filename, strategy="fast")
pdf_elements
for element in pdf_elements[:10]:
if len(element.text) > 40:
print(f"{element.category.upper(): <16}: {element.text[:40]} ...")
else:
print(f"{element.category.upper(): <16}: {element.text}")
INFO: pikepdf C++ to Python logger bridge initialized
HEADER : 1/30/24, 5:11 PM
HEADER : Pineapple express: California to get dre ...
HEADER : CNN 1/30/2024
NARRATIVETEXT : A potent pair of atmospheric rivers will ...
TITLE : By Mary Gilbert, CNN Meteorologist
TITLE : Updated: 3:49 PM EST, Tue January 30, 20 ...
TITLE : Source: CNN
NARRATIVETEXT : A potent pair of atmospheric river-fuele ...
NARRATIVETEXT : The soaking storms will raise the flood t ...
NARRATIVETEXT : El Niño – a natural phenomenon in the tr ...
Parsing PDF (DLD, API)
If we use DLD we see a great improvement over direct method.
For instance, the header are separated from the title.
with open(filename, "rb") as f:
=shared.Files(
files=f.read(),
content=filename,
file_name
)
= shared.PartitionParameters(
req =files,
files="hi_res",
strategy="yolox",
hi_res_model_name
)
try:
= s.general.partition(req)
resp = dict_to_elements(resp.elements)
dld_elements except SDKError as e:
print(e)
INFO: Preparing to split document for partition.
INFO: Starting page number set to 1
INFO: Allow failed set to 0
INFO: Concurrency level set to 5
INFO: Splitting pages 1 to 2 (2 total)
INFO: Determined optimal split size of 2 pages.
INFO: Document has too few pages (2) to be split efficiently. Partitioning without split.
INFO: Successfully partitioned the document.
for element in dld_elements[:10]:
if len(element.text) > 40:
print(f"{element.category.upper(): <16}: {element.text[:40]} ...")
else:
print(f"{element.category.upper(): <16}: {element.text}")
HEADER : 1/30/24, 5:11 PM
HEADER : CNN 1/30/2024
HEADER : Pineapple express: California to get dre ...
TITLE : A potent pair of atmospheric rivers will ...
NARRATIVETEXT : By Mary Gilbert, CNN Meteorologist
NARRATIVETEXT : Updated: 3:49 PM EST, Tue January 30, 20 ...
NARRATIVETEXT : Source: CNN
NARRATIVETEXT : A potent pair of atmospheric river-fuele ...
NARRATIVETEXT : The soaking storms will raise the flood t ...
NARRATIVETEXT : El Niño – a natural phenomenon in the tr ...
If we inspect the different categories, the PDF parser does a better job than the HTML!
import collections
= [el.category for el in html_elements]
html_categories collections.Counter(html_categories).most_common()
[('NarrativeText', 23), ('Title', 7), ('UncategorizedText', 1)]
= [el.category for el in dld_elements]
dld_categories collections.Counter(dld_categories).most_common()
[('NarrativeText', 28), ('Header', 6), ('Title', 4), ('Footer', 1)]
Extracting Tables
Introduction
We can use table transformer to process tabular data, getting JSON response.
- identify table location with a document layout model (bounding boxes)
- run the table through the table transfoermer
We can also just use ViT to extract table content. Typically we get HTML source code back.
Finally we can also use rules–based OCR to process the table.
Example
="images/embedded-images-tables.png", width=250) Image(filename
= "example_files/embedded-images-tables.pdf"
filename
with open(filename, "rb") as f:
=shared.Files(
files=f.read(),
content=filename,
file_name
)
= shared.PartitionParameters(
req =files,
files="hi_res",
strategy="yolox",
hi_res_model_name=[],
skip_infer_table_types=True, # This is important for tables
pdf_infer_table_structure
)
try:
= s.general.partition(req)
resp = dict_to_elements(resp.elements)
elements except SDKError as e:
print(e)
= [el for el in elements if el.category == "Table"] tables
0].text tables[
'Inhibitor Polarization Corrosion be (V/dec) ba (V/dec) Ecorr (V) icorr (AJcm?) concentration (g) resistance (Q) rate (mmj/year) 0.0335 0.0409 —0.9393 0.0003 24.0910 2.8163 1.9460 0.0596 .8276 0.0002 121.440 1.5054 0.0163 0.2369 .8825 0.0001 42121 0.9476 s NO 03233 0.0540 —0.8027 5.39E-05 373.180 0.4318 0.1240 0.0556 .5896 5.46E-05 305.650 0.3772 = 5 0.0382 0.0086 .5356 1.24E-05 246.080 0.0919'
from IPython.core.display import HTML
= tables[0].metadata.text_as_html
table_html HTML(table_html)
Inhibitor concentration (g) | be (V/dec) | ba (V/dec) | Ecorr (V) | icorr (AJcm?) | Polarization resistance (Q) | Corrosion rate (mmj/year) |
---|---|---|---|---|---|---|
0.0335 | 0.0409 | —0.9393 | 0.0003 | 24.0910 | 2.8163 | |
NO | 1.9460 | 0.0596 | —0.8276 | 0.0002 | 121.440 | 1.5054 |
0.0163 | 0.2369 | —0.8825 | 0.0001 | 42121 | 0.9476 | |
s | 03233 | 0.0540 | —0.8027 | 5.39E-05 | 373.180 | 0.4318 |
0.1240 | 0.0556 | —0.5896 | 5.46E-05 | 305.650 | 0.3772 | |
= 5 | 0.0382 | 0.0086 | —0.5356 | 1.24E-05 | 246.080 | 0.0919 |
Table Summarisation
Finally, we can use langchain to summarize the table. Here we use a local LLM served by Ollama instead of OpenAI.
(I use langchian because it was in the tutorial.)
from langchain_community.llms import Ollama
from langchain_core.documents import Document
from langchain.chains.summarize import load_summarize_chain
= "gemma2"
LLM_MODEL = Ollama(model=LLM_MODEL) llm
= load_summarize_chain(llm, chain_type="stuff")
chain = chain.invoke([Document(page_content=table_html)]) result
print(result['output_text'].strip())
The table presents electrochemical data for a material's corrosion behavior under various inhibitor concentrations.
Key parameters measured include:
* **Inhibitor concentration:** Ranges from 0 to 5 g.
* **be and ba (V/dec):** Polarization resistance coefficients related to anodic and cathodic reactions.
* **Ecorr (V):** Corrosion potential.
* **icorr (A/cm²):** Corrosion current density.
* **Polarization resistance (Q):** Resistance to electrochemical polarization.
* **Corrosion rate (mm/year):** Material degradation rate.
The data reveals that increasing inhibitor concentration generally decreases corrosion current and rate, suggesting an inhibitory effect on the material's corrosion process.
Building RAG Bot
Overview
Example
Our documents including a paper (PDF), a slid (PPTX) and a mardown file (README.md).
='images/donut_paper.png', width=200) Image(filename
='images/donut_slide.png', width=200) Image(filename
='images/donut_readme.png', width=200) Image(filename
Process PDF
= "example_files/donut_paper.pdf"
filename
with open(filename, "rb") as f:
=shared.Files(
files=f.read(),
content=filename,
file_name
)
= shared.PartitionParameters(
req =files,
files="fast",
strategy=True,
pdf_infer_table_structure=[],
skip_infer_table_types
)
try:
= s.general.partition(req)
resp = dict_to_elements(resp.elements)
pdf_elements except:
...
Filter Unwanted Items
- reference section in the paper
- the header in each page
print(len(pdf_elements), "-->", end=" ")
= [el for el in pdf_elements if el.category != "Footer"]
pdf_elements print(len(pdf_elements))
39 --> 34
Process PPTX and Markdown
from unstructured.partition.md import partition_md
= "example_files/donut_slide.pptx"
filename = partition_pptx(filename=filename)
pptx_elements
= "example_files/donut_readme.md"
filename = partition_md(filename=filename) md_elements
Prepare Vector Database
= chunk_by_title(pdf_elements + pptx_elements + md_elements)
elements len(elements)
97
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document
from langchain_community.llms import Ollama
from langchain_community.embeddings import OllamaEmbeddings
= "all-minilm"
EMBEDDING_MODEL = OllamaEmbeddings(model=EMBEDDING_MODEL) embeddings
= []
documents for element in elements:
= element.metadata.to_dict()
metadata del metadata["languages"]
"source"] = metadata["filename"]
metadata[for key, item in metadata.items():
if isinstance(item, list):
= "/".join(item)
metadata[key]
documents.append(=element.text, metadata=metadata)
Document(page_content )
= Chroma.from_documents(documents, embeddings) vectorstore
INFO: Anonymized telemetry enabled. See https://docs.trychroma.com/telemetry for more information.
Interact with the Vector Database
from langchain_community.llms import Ollama
from langchain.prompts.prompt import PromptTemplate
from langchain.chains import ConversationalRetrievalChain, LLMChain
from langchain.chains.qa_with_sources import load_qa_with_sources_chain
= "gemma2"
LLM_MODEL = Ollama(model=LLM_MODEL)
llm
= vectorstore.as_retriever(
retriever ="similarity",
search_type={"k": 6}
search_kwargs )
= """You are an AI assistant for answering questions \
template about the Donut document understanding model.
You are given the following extracted parts of a long document \
and a question. Provide a conversational answer. \
If you don't know the answer, just say "Hmm, I'm not sure." \
Don't try to make up an answer.
If the question is not about Donut, politely inform them that \
you are tuned to only answer questions about Donut.
Question: {question}
=========
{context}
=========
Answer in Markdown:"""
= PromptTemplate(
prompt =template, input_variables=["question", "context"]
template
)
= load_qa_with_sources_chain(llm, chain_type="map_reduce")
doc_chain
= LLMChain(llm=llm, prompt=prompt)
question_generator_chain
= ConversationalRetrievalChain(
qa_chain =retriever,
retriever=question_generator_chain,
question_generator=doc_chain,
combine_docs_chain )
qa_chain.invoke({"question": "How does Donut compare to other document understanding models?",
"chat_history": []
"answer"] })[
Token indices sequence length is longer than the specified maximum sequence length for this model (1837 > 1024). Running this sequence through the model will result in indexing errors
'Donut achieves state-of-the-art scores with reasonable speed and efficiency. It also shows superior accuracy compared to other models with significantly faster inference speed. \nSOURCES: donut_slide.pptx, donut_readme.md \n\n\n'
Query by Specifying Source
= vectorstore.as_retriever(
filter_retriever ="similarity",
search_type={"k": 1, "filter": {"source": "donut_readme.md"}}
search_kwargs )
= ConversationalRetrievalChain(
filter_chain =filter_retriever,
retriever=question_generator_chain,
question_generator=doc_chain,
combine_docs_chain )
filter_chain.invoke({"question": "How do I classify documents with DONUT?",
"chat_history": [],
"filter": filter,
"answer"] })[
'You classify documents with DONUT by using the `{"class" : {class_name}}` format, where `{class_name}` is the name of the document\'s class. For example, `{"class" : "scientific_report"}`. \n\nSOURCES: donut_readme.md \n'