mutool convert
from https://mupdf.com/ seems to always write extracted text to a file.
So I adapt its code a bit to read pdf text into an unsigned char
buffer.
I further add a thin Rust wrapper so that the same can be done from Rust side.
All C and Rust sum up to less than 300 lines of codes.
I'm aware of a number of awesome Rust pdf text extractors like pdf-extract.
There's also a rust binding of mupdf
here.
However, these libraries appear too large for my use case.
That's the main reason why I start this project.
Please refer to the Makefile
.
Not a perfect procedure, but basically:
- Install
mupdf
(e.g., on macOS,brew install mupdf
) - Adjust the
Makefile
to your platform (mine is macOS). You may also need to adjustbuild.rs
. - Run
make dylib
. This should producelibmuconvert.dylib
on macOS. - Run
cargo build
.
See muconvert-cli.c
as an example.
let filename = "hello.pdf";
// Assign a large enough buffer for the pdf (NOTE below),
// and you'll be fine.
// A possible heuristic for the buffer size is the file
// size in bytes.
let buf: Vec<u8> = vec![0; 103977368];
let text = muconvert_rust::pdftotext(filename, false, true, buf)?;
NOTE: I know this is a bit awkward. Currently, in case of a buffer too small error, here is a possible solution:
use muconvert_rust::{Error, pdftotext};
let filename = "hello.pdf";
// A very small buffer.
let buf: Vec<u8> = vec![0; 100];
match pdftotext(filename, false, true, buf) {
// Do something with the extracted text.
Ok(text) => (),
// In case of the buffer too small error,
Err(Error::BufferTooSmall(len, mut buf)) => {
// Extend the buffer.
buf.resize(len, 0);
// Retry, and do something with the text.
let text = pdftotext(filename, false, true, buf).unwrap();
}
// Handle other errors.
_ => (),
}
But I haven't yet found a way to keep the wrapper simple while avoiding such issue. Suggestions are welcome!