There is a new minor version of MAT2, the 0.8.0, with a single new feature: support for the epub format! This release is super-close to the previous one, because the debian buster freeze is near, and some people were really eager to have epub support in mat2 on it, so I wrote the code as fast as I could.
- Add support for epub files
- Fix the setup.py file crashing on non-utf8 platforms
- Improve css support
- Improve html support
While adding support for epub, I stumbled upon an interesting issue: everything was working great, except on the Debian instances of the CI. I tried to reproduce the issue in a debootstrap, but didn't managed to: the testsuite was working. I tried inside a virtual machine: same behaviour, everything was green.
So I added a lot of calls to
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 1210: ordinal not in range(128).
How come that this exception was silently ignored? Well, it's because it's a
UnicodeError, which is itself a subclass of
is the exception raised internally by mat2 when something goes wrong with the
parsing of a file.
The fix was to simply specify that mat2 should always use
utf-8 by adding
encoding=utf-8 in every call to methods/functions related to file-content
manipulation, because the Debian instead in the CI isn't apparently expressing
a preference in the environment about the fact that program shouldn't use
US-ASCII by default for everything.
Moreover, to avoid losing time again, mat2 is now displaying the content of the exception instead of silently swallowing it.
The previous version add html support. This was done to support epub, since this format is basically a bunch of html/css files stitched together in a zip archive. I thought this would be pretty easy to implement. I was wrong.
Python has a html parser in its stdlib, but:
- It's implemented via regular expressions, which is a notoriously bad idea.
- It's non-validating, meaning that you have to implement validation on top of
it. It was a great amount of
painfun to write one.
- There is a
get_starttag_textmethod to get the start of a tag, but there is no
get_endtag_text, so you have to somehow cache the opening tag in a LIFO to be able to transform it as a closing one when needed.
- Non-validating and not really made for export is a nice combo, because writing a state-machine to modify and validate a pseudo-xml document you're iterating on convinced me that having a drawing board in my room is a valudable investment of space, time, and not becoming crazy.
- The bulk of its code was written by Guido himself, in 2001, with only small bugfixes and no major cleanup/overhaul in 18 years.
Moreover, the epub specification has different versions, each of them more or less correctly implemented by e-readers. I might have written some ghetto python-scripts to upload a bunch of random epub files on various online validators, scraped their answers, and diff'ed them against their output of the same files, but cleaned up by mat2.
So, yeah, this was fun.
As usual, help is more than welcome.