Your PDF may reveal more than you intend
The Endeavour 2024-02-08
When you create a PDF file, what you see is not all you get. There is metadata embedded in the file that might be useful. It also might reveal information you’d rather not reveal.
The previous post looked at just the time stamp on a file. This post will look at more metadata, focusing on privacy implications.
Inspecting metadata
Here’s a little Python script we’ll use to inspect some of the metadata in a PDF. I say some because this does not pick out everything in every PDF.
from pypdf import PdfReader def print_metadata(filename): print("File: ", filename, "\n") reader = PdfReader(filename) meta = reader.metadata for m in meta: print(m, meta[m])
Let’s run this on the “Hello world” example from the previous post.
File: humpty.pdf /Creator Writer /Producer LibreOffice 7.5 /CreationDate D:20240208064322-06'00'
OK, so this shows that the file was created with LibreOffice Writer, version 7.5.
Time and location
It also shows when the file was written. As I discussed in the previous post, the file was written today at 6:43:22. But what I didn’t comment on before was the -6'00'
at the end. This is my time zone, six hours behind GMT, i.e. US Central Standard Time.
Note that the time zone isn’t just time information, it’s also location information. It’s no secret that I live in Houston, but if I didn’t want to reveal my location, this time stamp would partially give away where I live. (Probably. Strictly speaking it reveals the time zone setting on my computer.)
Microsoft Word files
I repeated my “Hello world” file experiment with Microsoft Word on an old laptop. When I exported to PDF I got the following.
/Author John Cook /Creator Microsoft® Word 2016 /CreationDate D:20240208101055-06'00' /ModDate D:20240208101055-06'00' /Producer Microsoft® Word 2016
So this includes my name. The installation program for Microsoft Office asks for your name, and I must have provided it. Either LibreOffice doesn’t ask or I didn’t enter it.
When I print to PDF rather than export to PDF I get slightly different output.
/Author John /CreationDate D:20240208101220-06'00' /ModDate D:20240208101220-06'00' /Producer Microsoft: Print To PDF /Title Microsoft Word - Document1
LaTeX files
Now let’s look at a PDF created from a LaTeX file. I created a file foo.tex
with the following content
\documentclass{article} \begin{document} Hello world. \end{document}
then compiled it with pdflatex foo.tex
. Let’s see what metadata our Python code can find.
/Producer pdfTeX-1.40.25 /Creator TeX /CreationDate D:20240208075059-06'00' /ModDate D:20240208075059-06'00' /Trapped /False /PTEX.Fullbanner This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023/MacPorts 2023.66589_1) kpathsea version 6.3.5
Obviously the file was created with TeX [1]. You can usually identify TeX files by their appearance. You can make a TeX file look less distinctive by changing the default font and a few other things. But if you did so without changing the metadata, someone could still determine that the file was made using TeX.
I’m not trying to conceal that I use LaTeX. But if you create a PDF with an obscure program, maybe that reveals more than you’d like to reveal.
Operating system
You can see that the file was produced on a Mac. When I compiled the same file on my Linux desktop, it showed the operating system as Debian but was not any more specific.
When you see that a file was created using Microsoft Word, it was probably created on Windows. I don’t have Word on my Mac, but I wouldn’t be surprised if the application was reported to be something like Office for MacOS rather than just Word.
I created a document with Microsoft 365 online and it reported the following.
/Author John Cook /Creator Microsoft Word /CreationDate D:20240208084209-08'00' /ModDate D:20240208084209-08'00'
The lack of an operating system in the Creator field may indicate that the document was created online. Note that the time zone is −8, i.e. Pacific Standard Time. This isn’t my time zone but the time zone of the server, perhaps in Seattle.
Related posts
- Save as PDF twice and you get different files
- Put PDF properties in a LaTeX file
- Conspicuously missing data
[1] LaTeX is written on top of TeX. The metadata says the file was created with TeX, because ultimately it really was.
The post Your PDF may reveal more than you intend first appeared on John D. Cook.