Converts a whole subdirectory with any volume (small or huge) of PDF documents to a dataset (pandas DataFrame). No need to setup any external service (no database, brokers, etc). Just install and run it!
- Conversion of a whole subdirectory with PDFs documents into a pandas DataFrame
- Support for parallel and distributed processing through ray
- Extractions are performed by page, making tasks distribution more uniform for handling documents with big differences in number of pages
- Incremental writing of resulting DataFrame, making possible to process data bigger than memory
- Error tracking of faulty documents
- Resume interrupted processing
- Extract text through pdftotext
- Use OCR for extracting text through pytesseract
- Extract images through pdf2image
- Support to implement custom features extraction
- Highly customizable behavior through params
# "-por" for portuguese, use the documents language
$ sudo dnf install -y gcc-c++ poppler-utils pkgconfig poppler-cpp-devel python3-devel tesseract-langpack-por
$ sudo apt update
# "-por" for portuguese, use the documents language
$ sudo apt install -y build-essential poppler-utils libpoppler-cpp-dev pkg-config python3-dev tesseract-ocr-por
$ pip3 install pdf2dataset --user # Please, isolate the environment
# First, install poetry, clone repository and cd into it
$ poetry install
# Note: path, page and error will always be present in resulting DataFrame
# Reads all PDFs from my_pdfs_dir and saves the resultant dataframe to my_df.parquet.gzip
$ pdf2dataset my_pdfs_dir my_df.parquet.gzip # Most basic, extract all possible features
$ pdf2dataset my_pdfs_dir my_df.parquet.gzip --features=text # Extract just text
$ pdf2dataset my_pdfs_dir my_df.parquet.gzip --features=image # Extract just image
$ pdf2dataset my_pdfs_dir my_df.parquet.gzip --num-cpus 1 # Maximum reducing of parallelism
$ pdf2dataset my_pdfs_dir my_df.parquet.gzip --ocr true # For scanned PDFs
$ pdf2dataset my_pdfs_dir my_df.parquet.gzip --ocr true --lang eng # For scanned documents with english text
In case of any interruption, to resume the processing, just use the same path as output and the
processing will be resumed automatically. The flag --saving-interval
(or the param saving_interval
)
controls the frequency the output path will be updated, and so, the processing "checkpoints".
There're some helper functions to facilitate pdf2dataset usage:
- extract: function can be used analogously to the CLI
- extract_text:
extract
wrapper withfeatures=text
- extract_image:
extract
wrapper withfeatures=image
- image_from_bytes: (pdf2image.utils) get a Pillow
Image
object given the image bytes - image_to_bytes: (pdf2image.utils) get the image bytes given the a Pillow
Image
object
from pdf2dataset import extract
extract('my_pdfs_dir', 'all_features.parquet.gzip')
One feature, not available to the CLI, is the custom behavior for handling small volumes of data (small can be understood as that: the extraction won't run for hours or days and won't be distributed).
The complete list of differences are:
- Faster initialization (use multiprocessing instead of ray)
- Don't save processing progress
- Distributed processing not supported
- Don't write dataframe to disk
- Returns the dataframe
from pdf2dataset import extract_text
df = extract_text('my_pdfs_dir', small=True)
# ...
Instead of specifying a directory, one can specify a list of files to be processed.
from pdf2dataset import extract
my_files = [
'./tests/samples/single_page1.pdf',
'./tests/samples/invalid1.pdf',
]
df = extract(my_files, small=True)
# ...
If you don't want to specify a directory for the documents, you can specify the tasks that will be processed.
The tasks can be of the form (document_name, document_bytes, page_number)
or just (document_name, document_bytes)
, document_name must ends with .pdf
but
don't need to be a real file, document_bytes are the bytes of the pdf document and
page_number is the number of the page to process (all pages, if not specified).
from pdf2dataset import extract_text
tasks = [
('a.pdf', a_bytes), # Processing all pages of this document
('b.pdf', b_bytes, 1),
('b.pdf', b_bytes, 2),
]
# 'df' will contain results from all pages from 'a.pdf' and page 1 and 2 from 'b.pdf'
df = extract_text(tasks, 'my_df.parquet.gzip', small=True)
# ...
If you don't want to handle the DataFrame, is possible to return a nested list with the features values. The structure for the resulting list is:
result = List[documents]
documents = List[pages]
pages = List[features]
features = List[feature]
feature = any
any
is any type supported by pyarrow.- features are ordered by the feature name (
text
,image
, etc)
>>> from pdf2dataset import extract_text
>>> extract_text('tests/samples', return_list=True)
[[[None]],
[['First page'], ['Second page'], ['Third page']],
[['My beautiful sample!']],
[['First page'], ['Second page'], ['Third page']],
[['My beautiful sample!']]]
- Features with error will have
None
value as result - Here,
extract_text
was used, so the only feature istext
With version >= 0.4.0, is also possible to easily implement extraction of custom features:
This is the structure:
from pdf2dataset import extract, feature, PdfExtractTask
class MyCustomTask(PdfExtractTask):
@feature('bool_')
def get_is_page_even(self):
return self.page % 2 == 0
@feature('binary')
def get_doc_first_bytes(self):
return self.file_bin[:10]
@feature('string', exceptions=[ValueError])
def get_wrong(self):
raise ValueError("There was a problem!")
if __name__ == '__main__':
df = extract('tests/samples', small=True, task_class=MyCustomTask)
print(df)
df.dropna(subset=['text'], inplace=True) # Discard invalid documents
print(df.iloc[0].error)
- First print:
path page doc_first_bytes ... text wrong error
0 invalid1.pdf -1 b"I'm invali" ... None None image_original:\nTraceback (most recent call l...
1 multi_page1.pdf 2 b'%PDF-1.5\n%' ... Second page None wrong:\nTraceback (most recent call last):\n ...
2 multi_page1.pdf 3 b'%PDF-1.5\n%' ... Third page None wrong:\nTraceback (most recent call last):\n ...
3 sub1/copy_multi_page1.pdf 1 b'%PDF-1.5\n%' ... First page None wrong:\nTraceback (most recent call last):\n ...
4 sub1/copy_multi_page1.pdf 3 b'%PDF-1.5\n%' ... Third page None wrong:\nTraceback (most recent call last):\n ...
5 multi_page1.pdf 1 b'%PDF-1.5\n%' ... First page None wrong:\nTraceback (most recent call last):\n ...
6 sub2/copy_single_page1.pdf 1 b'%PDF-1.5\n%' ... My beautiful sample! None wrong:\nTraceback (most recent call last):\n ...
7 sub1/copy_multi_page1.pdf 2 b'%PDF-1.5\n%' ... Second page None wrong:\nTraceback (most recent call last):\n ...
8 single_page1.pdf 1 b'%PDF-1.5\n%' ... My beautiful sample! None wrong:\nTraceback (most recent call last):\n ...
[9 rows x 8 columns]
- Second print:
wrong:
Traceback (most recent call last):
File "/home/icaro/Desktop/pdf2dataset/pdf2dataset/extract_task.py", line 32, in inner
result = feature_method(*args, **kwargs)
File "example.py", line 16, in get_wrong
raise ValueError("There was a problem!")
ValueError: There was a problem!
Notes:
@feature
is the decorator used to define new features.- The extraction method name must start with the prefix
get_
(avoids collisions with attribute names and increases readability) - First argument to
@feature
must be a valid PyArrow type, complete list here exceptions
param specify a list of exceptions to be recorded on DataFrame, otherwise they are raised- For this example, all available features plus the custom ones are extracted
The resulting "file" is a directory with structure specified by dask with pyarrow engine, it can be easily read with pandas or dask:
>>> import pandas as pd
>>> df = pd.read_parquet('my_df.parquet.gzip', engine='pyarrow')
>>> df
path page text error
index
0 single_page1.pdf 1 My beautiful sample!
1 sub1/copy_multi_page1.pdf 2 Second page
2 sub2/copy_single_page1.pdf 1 My beautiful sample!
3 sub1/copy_multi_page1.pdf 3 Third page
4 multi_page1.pdf 1 First page
5 multi_page1.pdf 3 Third page
6 sub1/copy_multi_page1.pdf 1 First page
7 multi_page1.pdf 2 Second page
0 invalid1.pdf -1 Traceback (most recent call last):\n File "/h...
There is no guarantee about the uniqueness or order of index
, you might need to create a new index with
the whole data in memory.
The -1
page number means that was not possible of even parsing the document.
Follow ray documentation for manual or automatic setup.
To go distributed you can run it just like local, but using the --address
and --redis-password
flags to point to your cluster (More information).
With version >= 0.2.0, only the head node needs to have access to the documents in disk.
usage: pdf2dataset [-h] [--features FEATURES]
[--saving-interval SAVING_INTERVAL] [--ocr-lang OCR_LANG]
[--ocr OCR] [--chunksize CHUNKSIZE]
[--image-size IMAGE_SIZE] [--ocr-image-size OCR_IMAGE_SIZE]
[--image-format IMAGE_FORMAT] [--num-cpus NUM_CPUS]
[--address ADDRESS] [--dashboard-host DASHBOARD_HOST]
[--redis-password REDIS_PASSWORD]
input_dir out_file
Extract text from all PDF files in a directory
positional arguments:
input_dir The folder to lookup for PDF files recursively
out_file File to save the resultant dataframe
optional arguments:
-h, --help show this help message and exit
--features FEATURES Specify a comma separated list with the features you
want to extract. 'path' and 'page' will always be
added. Available features to add: image, page, path,
text Examples: '--features=text,image' or '--
features=all'
--saving-interval SAVING_INTERVAL
Results will be persisted to results folder every
saving interval of pages
--ocr-lang OCR_LANG Tesseract language
--ocr OCR 'pytesseract' if true, else 'pdftotext'. default:
false
--chunksize CHUNKSIZE
Chunksize to use while processing pages, otherwise is
calculated
--image-size IMAGE_SIZE
If adding image feature, image will be resized to this
size. Provide two integers separated by 'x'. Example:
--image-size 1000x1414
--ocr-image-size OCR_IMAGE_SIZE
The height of the image OCR will be applied. Width
will be adjusted to keep the ratio.
--image-format IMAGE_FORMAT
Format of the image generated from the PDF pages
--num-cpus NUM_CPUS Number of cpus to use
--address ADDRESS Ray address to connect
--dashboard-host DASHBOARD_HOST
Which IP ray webui will try to listen on
--redis-password REDIS_PASSWORD
Redis password to use to connect with ray
- Troubles with high memory usage
-
Decrease the number of CPUs in use, reducing the level of parallelism, test it with
--num-cpus 1
flag and then increase according to your hardware. -
Use smaller chunksize, so less documents will be put in memory at once. Use
--chunksize 1
for having1 * num_cpus
documents in memory at once.
Just open your issues and/or pull requests, all are welcome 😃!