ZeroxPDFLoader
This notebook provides a quick overview for getting started with ZeroxPDF
document loader. For detailed documentation of all DocumentLoader features and configurations head to the API reference.
Overview
ZeroxPDFLoader
is a document loader that leverages the Zerox library. Zerox converts PDF documents into images, processes them using a vision-capable language model, and generates a structured Markdown representation. This loader allows for asynchronous operations and provides page-level document extraction.
Integration details
Class | Package | Local | Serializable | JS support |
---|---|---|---|---|
ZeroxPDFLoader | langchain_community | ❌ | ❌ | ❌ |
Loader features
Source | Document Lazy Loading | Native Async Support | Extract Images | Extract Tables |
---|---|---|---|---|
ZeroxPDFLoader | ✅ | ❌ | ✅ | ✅ |
Setup
Credentials
Appropriate credentials need to be set up in environment variables. The loader supports number of different models and model providers. See Usage header below to see few examples or Zerox documentation for a full list of supported models.
Installation
To use ZeroxPDFLoader
, you need to install the zerox
package. Also make sure to have langchain-community
installed.
pip install zerox langchain-community
Initialization
ZeroxPDFLoader
enables PDF text extraction using vision-capable language models by converting each page into an image and processing it asynchronously. To use this loader, you need to specify a model and configure any necessary environment variables for Zerox, such as API keys.
If you're working in an environment like Jupyter Notebook, you may need to handle asynchronous code by using nest_asyncio
. You can set this up as follows:
import nest_asyncio
nest_asyncio.apply()
import os
from getpass import getpass
# use nest_asyncio (only necessary inside of jupyter notebook)
import nest_asyncio
from dotenv import load_dotenv
from langchain_community.document_loaders.pdf import ZeroxPDFLoader
nest_asyncio.apply()
load_dotenv()
if not os.environ.get("OPENAI_API_KEY"):
os.environ["OPENAI_API_KEY"] = getpass("OpenAI API key =")
file_path = "./example_data/layout-parser-paper.pdf"
loader = ZeroxPDFLoader(file_path)
Load
docs = loader.load()
docs[0]
Document(metadata={'author': '', 'creationdate': '2021-06-22T01:27:10+00:00', 'creator': 'LaTeX with hyperref', 'keywords': '', 'moddate': '2021-06-22T01:27:10+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.14159265-2.6-1.40.21 (TeX Live 2020) kpathsea version 6.3.2', 'producer': 'pdfTeX-1.40.21', 'subject': '', 'title': '', 'trapped': 'False', 'source': './example_data/layout-parser-paper.pdf', 'total_pages': 16, 'num_pages': 16, 'page': 0}, page_content='# LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis\n\nZejian Shen¹ (✉), Ruosen Zhang², Melissa Dell³, Benjamin Charles Germain Lee⁴, Jacob Carlson³, and Weining Li⁵\n\n¹ Allen Institute for AI \nshannons@allenai.org \n² Brown University \nruosen_zhang@brown.edu \n³ Harvard University \n{melissadell, jacob.carlson}@fas.harvard.edu \n⁴ University of Washington \nbgcl@cs.washington.edu \n⁵ University of Waterloo \nw4221i@uwaterloo.ca \n\n**Abstract.** Recent advances in document image analysis (DIA) have been primarily driven by the application of neural networks. Ideally, research outcomes could be easily deployed in production and extended for further investigation. However, various factors like loosely organized codebases and sophisticated model configurations complicate the easy reuse of important innovations by a wide audience. Though there have been on-going efforts to improve reusability and simplify deep learning (DL) model development in disciplines like natural language processing and computer vision, none of them are optimized for challenges in the domain of DIA. This represents a major gap in the existing toolkit, as DIA is central to academic research across a wide range of disciplines in the social sciences and humanities. This paper introduces LayoutParser, an open-source library for streamlining the usage of DL in DIA research and applications. The core LayoutParser library comes with a set of simple and intuitive interfaces for applying and customizing DL models for layout detection, character recognition, and many other document processing tasks. To promote extensibility, LayoutParser also incorporates a community platform for sharing both pre-trained models and full document digitization pipelines. We demonstrate that LayoutParser is helpful for both lightweight and large-scale digitization pipelines in real-world use cases. The library is publicly available at [https://layout-parser.github.io](https://layout-parser.github.io).\n\n**Keywords:** Document Image Analysis · Deep Learning · Layout Analysis · Character Recognition · Open Source library · Toolkit.\n\n## 1 Introduction\n\nDeep Learning(DL)-based approaches are the state-of-the-art for a wide range of document image analysis (DIA) tasks including document image classification [11]')
import pprint
pprint.pp(docs[0].metadata)
{'author': '',
'creationdate': '2021-06-22T01:27:10+00:00',
'creator': 'LaTeX with hyperref',
'keywords': '',
'moddate': '2021-06-22T01:27:10+00:00',
'ptex.fullbanner': 'This is pdfTeX, Version 3.14159265-2.6-1.40.21 (TeX Live '
'2020) kpathsea version 6.3.2',
'producer': 'pdfTeX-1.40.21',
'subject': '',
'title': '',
'trapped': 'False',
'source': './example_data/layout-parser-paper.pdf',
'total_pages': 16,
'num_pages': 16,
'page': 0}
Lazy Load
pages = []
for doc in loader.lazy_load():
pages.append(doc)
if len(pages) >= 10:
# do some paged operation, e.g.
# index.upsert(page)
pages = []
len(pages)
print(pages[0].page_content[:100])
pprint.pp(pages[0].metadata)
The metadata attribute contains at least the following keys:
- source
- page (if in mode page)
- total_page
- creationdate
- creator
- producer
Additional metadata are specific to each parser. These pieces of information can be helpful (to categorize your PDFs for example).
Splitting mode & custom pages delimiter
When loading the PDF file you can split it in two different ways:
- By page
- As a single text flow
By default ZeroxPDFLoader will split the PDF by page.
Extract the PDF by page. Each page is extracted as a langchain Document object:
loader = ZeroxPDFLoader(
"./example_data/layout-parser-paper.pdf",
mode="page",
)
docs = loader.load()
print(len(docs))
pprint.pp(docs[0].metadata)
In this mode the pdf is split by pages and the resulting Documents metadata contains the page number. But in some cases we could want to process the pdf as a single text flow (so we don't cut some paragraphs in half). In this case you can use the single mode :
Extract the whole PDF as a single langchain Document object:
loader = ZeroxPDFLoader(
"./example_data/layout-parser-paper.pdf",
mode="single",
)
docs = loader.load()
print(len(docs))
pprint.pp(docs[0].metadata)
Logically, in this mode, the ‘page_number’ metadata disappears. Here's how to clearly identify where pages end in the text flow :
Add a custom pages_delimitor to identify where are ends of pages in single mode:
loader = ZeroxPDFLoader(
"./example_data/layout-parser-paper.pdf",
mode="single",
pages_delimitor="\n-------THIS IS A CUSTOM END OF PAGE-------\n",
)
docs = loader.load()
print(docs[0].page_content[:5780])
This could simply be \n, or \f to clearly indicate a page change, or <!-- PAGE BREAK --> for seamless injection in a Markdown viewer without a visual effect.
Extract images from the PDF
ZeroxPDFLoader is able to extract images from your PDFs.
from langchain_community.document_loaders.parsers.pdf import (
convert_images_to_description,
)
loader = ZeroxPDFLoader(
"./example_data/layout-parser-paper.pdf",
mode="page",
extract_images=True,
images_to_text=convert_images_to_description(model=None, format="html"),
)
docs = loader.load()
print(docs[5].page_content)
Working with Files
Many document loaders involve parsing files. The difference between such loaders usually stems from how the file is parsed, rather than how the file is loaded. For example, you can use open
to read the binary content of either a PDF or a markdown file, but you need different parsing logic to convert that binary data into text.
As a result, it can be helpful to decouple the parsing logic from the loading logic, which makes it easier to re-use a given parser regardless of how the data was loaded. You can use this strategy to analyze different files, with the same parsing parameters.
from langchain_community.document_loaders import FileSystemBlobLoader
from langchain_community.document_loaders.generic import GenericLoader
from langchain_community.document_loaders.parsers import ZeroxPDFParser
loader = GenericLoader(
blob_loader=FileSystemBlobLoader(
path="./example_data/",
glob="*.pdf",
),
blob_parser=ZeroxPDFParser(),
)
docs = loader.load()
print(docs[0].page_content)
pprint.pp(docs[0].metadata)
# LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis
Zejing Shen¹ (✉), Ruochen Zhang², Melissa Dell³, Benjamin Charles Germain Lee⁴, Jacob Carlson³, and Weining Li⁵
¹ Allen Institute for AI
shannons@allenai.org
² Brown University
ruochen_zhang@brown.edu
³ Harvard University
{melissadell, jacob.carlson}@fas.harvard.edu
⁴ University of Washington
bgcl@cs.washington.edu
⁵ University of Waterloo
w422ii@uwaterloo.ca
## Abstract
Recent advances in document image analysis (DIA) have been primarily driven by the application of neural networks. Ideally, research outcomes could be easily deployed in production and extended for further investigation. However, various factors like loosely organized codebases and sophisticated model configurations complicate the easy reuse of important innovations by a wide audience. Though there have been on-going efforts to improve reusability and simplify deep learning (DL) model development in disciplines like natural language processing and computer vision, none of them are optimized for challenges in the domain of DIA. This represents a major gap in the existing toolkit, as DIA is central to academic research across a wide range of disciplines in the social sciences and humanities. This paper introduces LayoutParser, an open-source library for streamlining the usage of DL in DIA research and applications. The core LayoutParser library comes with a set of simple and intuitive interfaces for applying and customizing DL models for layout detection, character recognition, and many other document processing tasks. To promote extensibility, LayoutParser also incorporates a community platform for sharing both pre-trained models and full document digitization pipelines. We demonstrate that LayoutParser is helpful for both lightweight and large-scale digitization pipelines in real-world use cases. The library is publicly available at [https://layout-parser.github.io](https://layout-parser.github.io).
**Keywords:** Document Image Analysis · Deep Learning · Layout Analysis · Character Recognition · Open Source library · Toolkit.
## 1 Introduction
Deep Learning (DL)-based approaches are the state-of-the-art for a wide range of document image analysis (DIA) tasks including document image classification [11]
{'author': '',
'creationdate': '2021-06-22T01:27:10+00:00',
'creator': 'LaTeX with hyperref',
'keywords': '',
'moddate': '2021-06-22T01:27:10+00:00',
'ptex.fullbanner': 'This is pdfTeX, Version 3.14159265-2.6-1.40.21 (TeX Live '
'2020) kpathsea version 6.3.2',
'producer': 'pdfTeX-1.40.21',
'subject': '',
'title': '',
'trapped': 'False',
'source': 'example_data/layout-parser-paper.pdf',
'total_pages': 16,
'num_pages': 16,
'page': 0}
It is possible to work with files from cloud storage.
from langchain_community.document_loaders import CloudBlobLoader
from langchain_community.document_loaders.generic import GenericLoader
loader = GenericLoader(
blob_loader=CloudBlobLoader(
url="s3:/mybucket", # Supports s3://, az://, gs://, file:// schemes.
glob="*.pdf",
),
blob_parser=ZeroxPDFParser(),
)
docs = loader.load()
print(docs[0].page_content)
pprint.pp(docs[0].metadata)
API reference
ZeroxPDFLoader
This loader class initializes with a file path and model type, and supports custom configurations via zerox_kwargs
for handling Zerox-specific parameters.
Arguments:
file_path
(Union[str, Path]): Path to the PDF file.model
(str): Vision-capable model to use for processing in format<provider>/<model>
. Some examples of valid values are:model = "gpt-4o-mini" ## openai model
model = "azure/gpt-4o-mini"
model = "gemini/gpt-4o-mini"
model="claude-3-opus-20240229"
model = "vertex_ai/gemini-1.5-flash-001"
- See more details in Zerox documentation
- Defaults to
"gpt-4o-mini".
**zerox_kwargs
(dict): Additional Zerox-specific parameters such as API key, endpoint, etc.
Methods:
lazy_load
: Generates an iterator ofDocument
instances, each representing a page of the PDF, along with metadata including page number and source.
See full API documentaton here
Notes
- Model Compatibility: Zerox supports a range of vision-capable models. Refer to Zerox's GitHub documentation for a list of supported models and configuration details.
- Environment Variables: Make sure to set required environment variables, such as
API_KEY
or endpoint details, as specified in the Zerox documentation. - Asynchronous Processing: If you encounter errors related to event loops in Jupyter Notebooks, you may need to apply
nest_asyncio
as shown in the setup section.
Troubleshooting
- RuntimeError: This event loop is already running: Use
nest_asyncio.apply()
to prevent asynchronous loop conflicts in environments like Jupyter. - Configuration Errors: Verify that the
zerox_kwargs
match the expected arguments for your chosen model and that all necessary environment variables are set.
Additional Resources
- Zerox Documentation: Zerox GitHub Repository
- LangChain Document Loaders: LangChain Documentation
Related
- Document loader conceptual guide
- Document loader how-to guides