Cleaning PDF metadata in depth

I already mentioned that the PDF format is a real mess; making it non-trivial to process, and thus non-trivial to remove every metadata that it could carry.

Some people are recommending exiftool for this, despite the warning in its documentation:

All metadata edits are reversible. While this would normally be considered an advantage, it is a potential security problem because old information is never actually deleted from the file.

You can indeed restore metadata removed with this method with exiftool -pdf-update:all= file.pdf

Others are using exiftool and qpdf to:

Append a new version of the metadata with exiftool
Remove unreferenced PDF objects (like old metadata) with qpdf

This method has several drawbacks in my opinion:

Nothing guarantees that your old metadata will actually be removed, if they are referenced somewhere else in your file.
This approach won't clean metadata of files embedded within the PDF.

That's why MAT is using a different approach, it's completely re-rending the PDF file, on a Cairo's PDF Surface, to export it as a real PDF file, like a normal, physical printing.

This ensures that:

Metadata from images are removed, since they are re-renderer
Videos are transformed into screenshots (This is a actually a feature, because it's making video-powered fingerprinting much more harder.),
Weird embedded objects are discarded
Javascript is disabled (goodby exploit-kits)
...

To my knowledge, this is for now the ~~best~~ less worse way to clean a PDF file; but I'll be delighted to be proven otherwise ;)

(Ho, and by the way, since several people asked me about this, I sat a github mirror up for MAT. Send me pull-requests to prove me this it's worth keeping it alive.)

Artificial truth

archives | latest | homepage

Cleaning PDF metadata in depth
Tue 25 August 2015 — download