Extract text from a PDF
The Endeavour 2024-04-20
Arshad Khan left a comment on my post on the less
and more
utilities saying “on ubuntu if I do less
on a pdf file, it shows me the text contents of the pdf.”
Apparently this is an undocumented feature of GNU less
. It works, but I don’t see anything about it in the man
page documentation.
Not all versions of less
do this. On my Mac, less
applied to a PDF gives a warning saying “… may be a binary file. See it anyway?” If you insist, it will dump gibberish to the command line.
A more portable way to extract text from a PDF would be to use something like the pypdf
Python module:
from pypdf import PdfReader reader = PdfReader("myfile.pdf") for page in reader.pages: print(page.extract_text())
The pypdf
documentation gives several options for how to extract text. The documentation also gives a helpful discussion of why it’s not always clear what extracting text from a PDF should mean. Should captions and page numbers be extracted? What about tables? In what order should text elements be extracted?
PDF files are notoriously bad as a data exchange format. When you extract text from a PDF, you’re likely not using the file in a way its author intended, maybe even in a way the author tried to discourage.
Related post: Your PDF may reveal more than you intend
The post Extract text from a PDF first appeared on John D. Cook.