I was accepted into the Google Summer of Code under the umbrella of the Tor Project to write a metadata cleaning tool.

Design

Requirement/Deliverables:

A command line and a GUI tool having both the following capabilities (in order of importance):
1. Listing the metadata embedded in a given file
2. A batch mode to handle a whole directory (or set of directories)
3. The ability to scan files packed in the most common archive formats
4. A nice binding for srm (Secure ReMoval) or shred to properly remove the original file containing the evil metadata.
5. Let the user delete/modify a specific meta
Should run on the most common OS/architectures, especially on Debian Squeeze, since Tails is based on it.
The whole thing should be easily extensible: it should be easy to add support for new file formats
The proper functioning of the software should be easily testable.

I'd like to do this project in Python, because I already have done some personal projects with it (for which I also used subversion) : an IRC bot tailored for logging, a battery monitor, a simple search engine indexing FTP servers, ...

Why is Python a good choice for implementing this project ?

I am "experienced" with the language
There are plenty of libraries to read/write metadatas, among them is Hachoir that looks very promising since it supports quite a few file formats
It is easy to wrap other libraries for our needs, even if they are not written in Python.
Runs on almost every OS/architecture, what is a great benefit for portability
It is easy to make unit tests, thanks to the built-in unittest module.

Proposed design:

The proposed design has three main components: a library, a command line and a GUI.

The aim of the library (described with more details in the next part) is to make the development of tools easy. A special attention will be made on the API that it exposes. The ultimate goal being to be able to add the support of new file format in the library without changing the whole code of the tools.

Meta reading/writing library :

A library to read and write metadata for various file formats. The main goal is to provide an abstraction interface (for the file format and for the underlying libraries used). At first it would only wrap Hachoir.

Why hachoir :

Autofix: Hachoir is able to open invalid / truncated files
Lazy: Open a file is very fast since no information is read from file, data are read and/or computed when the user ask for it
Types: Hachoir has many predefined field types (integer, bit, string, etc.) and supports string with charset (ISO-8859-1, UTF-8, UTF-16, ...)
Addresses and sizes are stored in bit, so flags are stored as classic fields
Editor: Using Hachoir representation of data, you can edit, insert, remove data and then save in a new file.
Meta : Support a very large scale of file format

But we could also wrap other libraries to support a particular file format. Or write ourself the support for a format, although this should be avoided if possible (it looks simple at first, but supporting different versions of the format and maintaining the thing over time is extremely time consuming) The must would be to make the children libraries optional dependencies.

One typical use case of the lib is to ask for metadatas for a file, if the format is supported a list (or maybe a tree) of metas is returned.

Both the GUI and the command line tool will use this lib.

The command line and GUI feature:

List all the meta
Removing all the meta
Anonymising all the meta
Let the user chose which meta he wants to modify
Support archives anonymisation
Secure removal
Cleaning wholes folder recursively

GUI

Essentially the GUI tool would do the same features as for the command line too. I do not have a significant GUI development experience, but I'm planing to fix that point during community bonding period.

Timeline

Community Bonding Period (in order of importance)

Playing around with pygobject
Playing with Hachoir
Learning git

First two weeks :

Create the structure in the repository (directories, README, ..)
Create a skeleton
Objectives : to have a deployable working system as soon as possible (even if the list of features is ridiculous). So that I can show you my work in an incremental way thereafter and get feedbacks early.
The lib will handle reading/writing EXIF fields (using Hachoir)
A set of tests files (and automated unit tests) to demonstrate that the lib does the job
The beginning of the command line tool, at this point must list and delete EXIF meta
An automated end-to-end test to show that the command line tool does properly remove the EXIF

After this first step (making the skeleton) I should be able to deliver a working system right after adding each of the following features. I Hope to get feedbacks so can fix problems quickly

3 weeks

Adding support for (in order of importance) pdf, zip/tar/bzip (just the meta, not the content yet), jpeg/png/bmp, ogg/mpeg1-2-3, exe...
For every type of meta, that involves :
Creating some input test files with meta data
Implementing the feature in the library
Asserting that the lib does the job with unit tests
Modifying the cmd line tool to support the feature (if necessary)
Checking that the cmd line tool can properly delete this type of meta with automated end-to-end test

about 1 day

Enable the command line tool to set a specific meta to a chosen value

about 1 day

Implementation of the “batch mode” in the command line, to clean a whole folder
Implementation of secure removal

about 2 days

Add support for deep archive cleanup
Clean the content of the archives
Make a list of non supported format, for which we warn the user that only the container can be cleaned from meta, not the content. At first that will include rar, 7zip, ...
The supported formats will be those supported natively by Python: bzip2, gzip, and tar
Create some test archives for each supported format containing various files with metas
Implement the deep cleanup for the format
Assert that the command line passes the end-to-end tests (that is, it can correctly clean the content of the test archives)

about 2 days

Add support for complete deletion of the original files
Make a binding nice for shred (should not be to hard using Python)
Implement the feature in the command line tool

3 weeks

Implementation of the GUI tool
At this stage, I can use the experience from implementing the command line to implement the GUI, having the same features.

1 week

Add support for more format (might be based on requests from the community)

Remaining weeks

I want to keep those remaining week in case of problems, and for

Remaining/polishing cleanup
Bugfixing
Integration work
Missing features
Packaging
Final documentation

Every Week-end

Documentation time : both end-user, and design. I do not like to document my code while I'm coding it : it slows a lot the development process, but it’s not a good thing to delay it too much : week-ends seems fine for this.
A blog-post, and a mail on the mailing list about what I have done in the week.

Link to the original blogpost.

Artificial truth

archives | latest | homepage

Design of MAT and GSoC timeline
Tue 26 April 2011 — download

Design

Requirement/Deliverables:

Why is Python a good choice for implementing this project ?

Proposed design:

Meta reading/writing library :

GUI

Timeline

Community Bonding Period (in order of importance)

First two weeks :

3 weeks

about 1 day

about 1 day

about 2 days

about 2 days

3 weeks

1 week

Remaining weeks

Every Week-end