Title: Paper notes - Exploitation and Sanitization of Hidden Data in PDF files
Date: 2021-03-13 23:00

Complete title: Exploitation and Sanitization of Hidden Data in PDF Files — Do Security Agencies Sanitize Their PDF files?
PDF: [1aafa6779c77d28afbafcda4e17a69ee1dc3a26164ae60830d9c26a1499b9f7f](https://arxiv.org/pdf/2103.02707.pdf)

A investigation paper from Supriya Adhatarao and Cédric Lauradoux from the
[Inria](https://www.inria.fr/en):

> We have crawled the websites of 75 security agencies of 47 countries and
collected 39664 PDF files. For the majority of the files (76%), we were able to
recover the authoring process: we identify the PDF producer tool and the
Operating System (OS) used by the file's authors. Collecting and analyzing PDF
files from the same source over several years can reveal the habits of a given
employee. It is possible to learn if he/she update/change (or not) their
software regularly. For instance, we found one employee of a security agency
who has never changed or updated his/her software during a period of 5 years

<span/>

> In our dataset, 13166(33%) PDF files reveal the identity of the individual
who have created the file

<span/>

> We found that, in our dataset 30155 (76%) PDF files include the meta-data
information on the PDF producer tool used. 

<span/>

> In our dataset, OS details are revealed in 16805 (42%) PDF files

<span/>

> We found complete location of where a file is located for 1814 PDF files
(4.5%)

Unfortunately, the paper doesn't really delve into solutions to remove metadata:

> Adobe Acrobat tool […] cleans the metadata and all the hidden content of the
PDF file. This is the most complete sanitization tool we have used in our work.

This doesn't take into account metadata from embedded files, like pictures,
scripts, fonts, or embedded *objects*, which are often more *juicy* than the
ones from the PDF itself.

But truth be told, because the PDF format is [super
complex]({filename}/metadata/some-funny-stuffs-about-pdf.md), there is no good™
method to sanitize it for publication. The way
[mat2](https://0xacab.org/jvoisin/mat2) is doing it is by rendering every page onto a
picture, and assembling them into a new PDF file. Unfortunately, this ruins the
accessibility. A better solution would be to simply publish documents in
plain-text.

