Artificial truth

The more you see, the less you believe.

[archives] [latest] | [homepage] | [atom/rss]

MAT2 0.8.0
Thu 28 February 2019 — download

There is a new minor version of MAT2, the 0.8.0, with a single new feature: support for the epub format! This release is super-close to the previous one, because the debian buster freeze is near, and some people were really eager to have epub support in mat2 on it, so I wrote the code as fast as I could.

Changelog

  • Add support for epub files
  • Fix the setup.py file crashing on non-utf8 platforms
  • Improve css support
  • Improve html support

Debugging an annoying issue on Debian

While adding support for epub, I stumbled upon an interesting issue: everything was working great, except on the Debian instances of the CI. I tried to reproduce the issue in a debootstrap, but didn't managed to: the testsuite was working. I tried inside a virtual machine: same behaviour, everything was green.

So I added a lot of calls to print everywhere, to see what was going on in the CI, and this finally boiled down to the infamous UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 1210: ordinal not in range(128). How come that this exception was silently ignored? Well, it's because it's a subclass of UnicodeError, which is itself a subclass of ValueError, which is the exception raised internally by mat2 when something goes wrong with the parsing of a file.

The fix was to simply specify that mat2 should always use utf-8 by adding encoding=utf-8 in every call to methods/functions related to file-content manipulation, because the Debian instead in the CI isn't apparently expressing a preference in the environment about the fact that program shouldn't use US-ASCII by default for everything.

Moreover, to avoid losing time again, mat2 is now displaying the content of the exception instead of silently swallowing it.

Implementing epub support

The previous version add html support. This was done to support epub, since this format is basically a bunch of html/css files stitched together in a zip archive. I thought this would be pretty easy to implement. I was wrong.

Python has a html parser in its stdlib, but:

  • It's implemented via regular expressions, which is a notoriously bad idea.
  • It's non-validating, meaning that you have to implement validation on top of it. It was a great amount of pain fun to write one.
  • There is a get_starttag_text method to get the start of a tag, but there is no get_endtag_text, so you have to somehow cache the opening tag in a LIFO to be able to transform it as a closing one when needed.
  • Non-validating and not really made for export is a nice combo, because writing a state-machine to modify and validate a pseudo-xml document you're iterating on convinced me that having a drawing board in my room is a valudable investment of space, time, and not becoming crazy.
  • The bulk of its code was written by Guido himself, in 2001, with only small bugfixes and no major cleanup/overhaul in 18 years.

Moreover, the epub specification has different versions, each of them more or less correctly implemented by e-readers. I might have written some ghetto python-scripts to upload a bunch of random epub files on various online validators, scraped their answers, and diff'ed them against their output of the same files, but cleaned up by mat2.

So, yeah, this was fun.

Conclusion

As usual, help is more than welcome.