Artificial truth

The more you see, the less you believe.

[archives] [latest] | [homepage] | [atom/rss/twitter]

MAT2 0.9.0
Fri 10 May 2019 — download

There is a new minor version of MAT2, the 0.9.0, with the support of tar/tar.gz.tar.bz2/tar.xz files as major new feature. It might also be the very last one before the almighty 1.0, maybe, who knows…


  • Add tar/tar.gz/tar.bz2/tar.xz archives support
  • Add support for xhtml files
  • Improve handling of read-only files
  • Improve a bit the command line's documentation
  • Fix a confusing error message
  • Add even more tests
  • Fix a possible mp3-related crash
  • Usuals internal cleanups/refactorings

What's happening Debian-side?

The last release, the 0.8.0 was released shortly after the 0.7.0, because of the Debian buster freeze: after the freeze, large/disruptive changes are no longer accepted, only bugfixes. Hence why I rushed a bit the release, to get the .epub support in.

The 28th February, mat2-0.8.0-1 was uploaded into Debian The trailing -1 in the package version is Debian-specific, and means that it's the package's first revision.

The 1st of March, mat2-0.8.0-2 was uploaded, to gracefully handle the transition between mat and mat2, by declaring that mat2 is breaking (and replacing) mat, and shipping a transitional package (also named mat, since mat isn't shipped anymore in Debian) that effectively pulls mat2. The effects of this change can be seen on the following graph:

Popcon graph showing the transition from mat to mat2

But everything isn't super green in Debian-land: the nautilus-python package in Buster is still using python2, while mat2 requires python3. Fortunately, a solution was founded: the Debian package is patching the extension to use the mat2 binary instead of calling libmat2. While this is a super-gross hack, it makes it possible for everyone to clean their metadata via a simple right-clic, which is what matters.

All of this is happening because mat2 is actively packaged in Debian by amazing people: the original maintainer for mat used to be intrigeri, but nowadays it's georg and jonas that are taking good care of mat2 in Debian.

What's happening Fedora-side?

Fedora 30 was released, and while I'm quite sure it comes with a lot of amazing stuff, the only one I'm caring about is the complete migration to Python3 for Nautilus Files and its ecosystem! This means that the mat2 extension is now working properly, thanks to the Fedora maintainer, atenart! To be fair, mat2 is only in COPR for now, but getting into Fedora is on the todo list.

Speeding up the CI

mat2 is using Gitlab's CI to trigger a run of the testsuite of each commit, and at least once every week, on Debian (with and without bubblewrap), Archlinux, Fedora and Gentoo, to ensure that everything is working correctly on those platforms. The CI is also used to find potential bugs via linters like pyflakes and pylint, or typing-related issues via mypy (because Python's typing system is awful, reliability-wise). Moreover, this is also how code coverage is enforced, to make sure that all the paths in the codebase are triggered by the testsuite.

A downside of such an extensive testing, is the time it takes to run: around 5 minutes. While this doesn't sound to be a lot, when people are submitting a merge-request, they want to quickly know if their code is acceptable or not.

Thanks to georg (again), who provided a privileged gitlab runner, the testsuite is now running on tailored containers, with all the required dependencies already installed, shrinking a whole testsuite run from ~300s to ~90s.

A minor internal naming-related change

I'm not a native English speaker (hence why this blog is mostly made of butchered sentences riddled with horrible grammatical mistakes), and while I can read, write and speak it fluently, I don't know much about its history, where it comes from, what shaped its evolution, the history of its speakers, …

Luckily, one of my flatmates was born in the US and is patient enough to highlight and correct the mistakes I'm making when I'm speaking English at home. She was also king enough to hand me a copy of A Person Paper on Purity in Language by Douglas Hofstadter.

Here is a small excerpt:

Most of the clamor, as you certainly know by now, revolves around the age-old usage of the noun "white" and words built from it, such as chairwhite, mailwhite, repairwhite, clergywhite, middlewhite, Frenchwhite, forewhite, whitepower, whiteslaughter, oneupuwhiteship, straw white, whitehandle, and so on. The negrists claim that using the word "white," either on its own or as a component, to talk about all the members of the human species is somehow degrading to blacks and reinforces racism. Therefore the libbers propose that we substitute "person" everywhere where "white" now occurs.

Sensitive speakers of our secretary tongue of course find this preposterous. There is great beauty to a phrase such as "All whites are created equal." Our forebosses who framed the Declaration of Independence well understood the poetry of our language. Think how ugly it would be to say "All persons are created equal," or "All whites and blacks are created equal." Besides, as any schoolwhitey can tell you, such phrases are redundant. In most contexts, it is self-evident when "white" is being used in an inclusive sense, in which case it subsumes members of the darker race just as much as fairskins.

It made me realise that using the terms whitelist/blacklist wasn't as innocuous as I thought it was, so we replaced them with allowlist/blocklist.

Tar files support

Python has a zipfile module for handling zip files, and a tarfile module for handling tar files: they are sufficiently similar to be wrapped in a single parser class in mat2, but also different enough that I spent a whole afternoon and a good chunk of a night trying to make this happen.

To open a zip file, one can use zipfile.ZipFile(). To open a tar file, one can use tarfile.TarFile, except that this will burst into flames with 3 nested exceptions about invalid headers and the fact that the ascii codec can't decode some shit as soon as it's used on compressed files, because as said in the documentation, should be used instead.

To get all the members of a zip file, it's ZipFile.infolist(), for tar, it's TarFile.getmembers(). ZipFile.extract isn't vulnerable to path traversals, but TarFile.extract is. To add stuff to a zip, one write them, but for tar, it's add. Zip archive members have a filename and a date_time as a 6-members tuple, while tar ones have a name and a mtime as a timestamp.

Amusingly, since tar files are supporting permissions, care had to be taken to correctly handle unreadable/unwritable files, and to restore their permissions after processing.

But the best part, the very best one is about security: for zip files, there is a nice warning, and a safe method. However, for tar files, there is a nice warning, and a … other one, but no safe method to extract stuff. There is a 4 years old bug open on Python's bugtracker about this, with attached patches to provide a secure-by-default way, but it's still being bikeshedded.

So I implemented checks myself, for:

  • Absolute symlinks
  • Relative symlinks
  • Setuid files
  • Setgid files
  • External symlinks
  • Hardlinks
  • Block devices
  • Character devices

Of course, I'm quite sure that I forgot some interesting cases, and that I'll get a CVE about this sooner or later, but there isn't really a better solution for now.

External services

Because people are lazy, I added a github mirror of mat2, automatically kept in sync because gitlab is magic. Beside making it easier for people to contribute, this allows me to throw mat2's codebase at various static analysers that didn't find any issue that the open-source trio pylint/pyflakes/mypy didn't catch before. So maybe all those "fuck you pylint" and "mypy is stupid" commits weren't in vain after all.


Bolting archive support was an interesting software design problem with no elegant solution (I would be happy to be proved wrong), the rest was mostly bug fixes.

As usual, if you know some Python, help is more than welcome.