Complete title: Exploitation and Sanitization of Hidden Data in PDF Files — Do Security Agencies Sanitize Their PDF files? PDF: 1aafa6779c77d28afbafcda4e17a69ee1dc3a26164ae60830d9c26a1499b9f7f
A investigation paper from Supriya Adhatarao and Cédric Lauradoux from the Inria:
We have crawled the websites of 75 security agencies of 47 countries and collected 39664 PDF files. For the majority of the files (76%), we were able to recover the authoring process: we identify the PDF producer tool and the Operating System (OS) used by the file's authors. Collecting and analyzing PDF files from the same source over several years can reveal the habits of a given employee. It is possible to learn if he/she update/change (or not) their software regularly. For instance, we found one employee of a security agency who has never changed or updated his/her software during a period of 5 years
In our dataset, 13166(33%) PDF files reveal the identity of the individual who have created the file
We found that, in our dataset 30155 (76%) PDF files include the meta-data information on the PDF producer tool used.
In our dataset, OS details are revealed in 16805 (42%) PDF files
We found complete location of where a file is located for 1814 PDF files (4.5%)
Unfortunately, the paper doesn't really delve into solutions to remove metadata:
Adobe Acrobat tool […] cleans the metadata and all the hidden content of the PDF file. This is the most complete sanitization tool we have used in our work.
This doesn't take into account metadata from embedded files, like pictures, scripts, fonts, or embedded objects, which are often more juicy than the ones from the PDF itself.
But truth be told, because the PDF format is super complex, there is no good™ method to sanitize it for publication. The way mat2 is doing it is by rendering every page onto a picture, and assembling them into a new PDF file. Unfortunately, this ruins the accessibility. A better solution would be to simply publish documents in plain-text.