Artificial truth

archives | latest | homepage | atom/rss/twitter

The more you see, the less you believe.

Design of MAT and GSoC timeline
Tue 26 April 2011 — download

I was accepted into the Google Summer of Code under the umbrella of the Tor Project to write a metadata cleaning tool.

Design

Requirement/Deliverables:

  • A command line and a GUI tool having both the following capabilities (in order of importance):
    1. Listing the metadata embedded in a given file
    2. A batch mode to handle a whole directory (or set of directories)
    3. The ability to scan files packed in the most common archive formats
    4. A nice binding for srm (Secure ReMoval) or shred to properly remove the original file containing the evil metadata.
    5. Let the user delete/modify a specific meta
  • Should run on the most common OS/architectures, especially on Debian Squeeze, since Tails is based on it.
  • The whole thing should be easily extensible: it should be easy to add support for new file formats
  • The proper functioning of the software should be easily testable.

I'd like to do this project in Python, because I already have done some personal projects with it (for which I also used subversion) : an IRC bot tailored for logging, a battery monitor, a simple search engine indexing FTP servers, ...

Why is Python a good choice for implementing this project ?

  • I am "experienced" with the language
  • There are plenty of libraries to read/write metadatas, among them is Hachoir that looks very promising since it supports quite a few file formats
  • It is easy to wrap other libraries for our needs, even if they are not written in Python.
  • Runs on almost every OS/architecture, what is a great benefit for portability
  • It is easy to make unit tests, thanks to the built-in unittest module.

Proposed design:

The proposed design has three main components: a library, a command line and a GUI.

The aim of the library (described with more details in the next part) is to make the development of tools easy. A special attention will be made on the API that it exposes. The ultimate goal being to be able to add the support of new file format in the library without changing the whole code of the tools.

Meta reading/writing library :

A library to read and write metadata for various file formats. The main goal is to provide an abstraction interface (for the file format and for the underlying libraries used). At first it would only wrap Hachoir.

Why hachoir :

  1. Autofix: Hachoir is able to open invalid / truncated files
  2. Lazy: Open a file is very fast since no information is read from file, data are read and/or computed when the user ask for it
  3. Types: Hachoir has many predefined field types (integer, bit, string, etc.) and supports string with charset (ISO-8859-1, UTF-8, UTF-16, ...)
  4. Addresses and sizes are stored in bit, so flags are stored as classic fields
  5. Editor: Using Hachoir representation of data, you can edit, insert, remove data and then save in a new file.
  6. Meta : Support a very large scale of file format

But we could also wrap other libraries to support a particular file format. Or write ourself the support for a format, although this should be avoided if possible (it looks simple at first, but supporting different versions of the format and maintaining the thing over time is extremely time consuming) The must would be to make the children libraries optional dependencies.

One typical use case of the lib is to ask for metadatas for a file, if the format is supported a list (or maybe a tree) of metas is returned.

Both the GUI and the command line tool will use this lib.

The command line and GUI feature:

  • List all the meta
  • Removing all the meta
  • Anonymising all the meta
  • Let the user chose which meta he wants to modify
  • Support archives anonymisation
  • Secure removal
  • Cleaning wholes folder recursively

GUI

Essentially the GUI tool would do the same features as for the command line too. I do not have a significant GUI development experience, but I'm planing to fix that point during community bonding period.

Timeline

Community Bonding Period (in order of importance)

  • Playing around with pygobject
  • Playing with Hachoir
  • Learning git

First two weeks :

  • Create the structure in the repository (directories, README, ..)
  • Create a skeleton
  • Objectives : to have a deployable working system as soon as possible (even if the list of features is ridiculous). So that I can show you my work in an incremental way thereafter and get feedbacks early.
  • The lib will handle reading/writing EXIF fields (using Hachoir)
  • A set of tests files (and automated unit tests) to demonstrate that the lib does the job
  • The beginning of the command line tool, at this point must list and delete EXIF meta
  • An automated end-to-end test to show that the command line tool does properly remove the EXIF

After this first step (making the skeleton) I should be able to deliver a working system right after adding each of the following features. I Hope to get feedbacks so can fix problems quickly

3 weeks

  • Adding support for (in order of importance) pdf, zip/tar/bzip (just the meta, not the content yet), jpeg/png/bmp, ogg/mpeg1-2-3, exe...
  • For every type of meta, that involves :
  • Creating some input test files with meta data
  • Implementing the feature in the library
  • Asserting that the lib does the job with unit tests
  • Modifying the cmd line tool to support the feature (if necessary)
  • Checking that the cmd line tool can properly delete this type of meta with automated end-to-end test

about 1 day

  • Enable the command line tool to set a specific meta to a chosen value

about 1 day

  • Implementation of the “batch mode” in the command line, to clean a whole folder
  • Implementation of secure removal

about 2 days

  • Add support for deep archive cleanup
  • Clean the content of the archives
  • Make a list of non supported format, for which we warn the user that only the container can be cleaned from meta, not the content. At first that will include rar, 7zip, ...
  • The supported formats will be those supported natively by Python: bzip2, gzip, and tar
  • Create some test archives for each supported format containing various files with metas
  • Implement the deep cleanup for the format
  • Assert that the command line passes the end-to-end tests (that is, it can correctly clean the content of the test archives)

about 2 days

  • Add support for complete deletion of the original files
  • Make a binding nice for shred (should not be to hard using Python)
  • Implement the feature in the command line tool

3 weeks

  • Implementation of the GUI tool
  • At this stage, I can use the experience from implementing the command line to implement the GUI, having the same features.

1 week

Add support for more format (might be based on requests from the community)

Remaining weeks

I want to keep those remaining week in case of problems, and for

  • Remaining/polishing cleanup
  • Bugfixing
  • Integration work
  • Missing features
  • Packaging
  • Final documentation

Every Week-end

  • Documentation time : both end-user, and design. I do not like to document my code while I'm coding it : it slows a lot the development process, but it’s not a good thing to delay it too much : week-ends seems fine for this.
  • A blog-post, and a mail on the mailing list about what I have done in the week.

Link to the original blogpost.