I already mentioned that the PDF format is a real mess; making it non-trivial to process, and thus non-trivial to remove every metadata that it could carry.
Some people are recommending exiftool for this, despite the warning in its documentation:
All metadata edits are reversible. While this would normally be considered an advantage, it is a potential security problem because old information is never actually deleted from the file.
You can indeed restore metadata removed with this method with exiftool -pdf-update:all= file.pdf
Others are using exiftool and qpdf to:
- Append a new version of the metadata with exiftool
- Remove unreferenced PDF objects (like old metadata) with qpdf
This method has several drawbacks in my opinion:
- Nothing guarantees that your old metadata will actually be removed, if they are referenced somewhere else in your file.
- This approach won't clean metadata of files embedded within the PDF.
That's why MAT is using a different approach, it's completely re-rending the PDF file, on a Cairo's PDF Surface, to export it as a real PDF file, like a normal, physical printing.
This ensures that:
- Metadata from images are removed, since they are re-renderer
- Videos are transformed into screenshots (This is a actually a feature, because it's making video-powered fingerprinting much more harder.),
- Weird embedded objects are discarded
- Javascript is disabled (goodby exploit-kits)
- ...
To my knowledge, this is for now the best less worse way to clean a PDF file;
but I'll be delighted to be proven otherwise ;)
(Ho, and by the way, since several people asked me about this, I sat a github mirror up for MAT. Send me pull-requests to prove me this it's worth keeping it alive.)