My previously blogpost about python sparked a lot of fuzzing-related questions and discussions, so here is a blogpost featuring bits of my journey to fuzz python libraries.
I started to fuzz Python's native code parts at work, and was wondering how hard it would be to fuzz at a higher-level, since the coverage information I got was too low-level to be efficient for pure Python code.
Since I'm a grown-up, instead of rushing into writing my own fuzzer, I spent some time browsing the web to see if anyone else had open-sourced something that I could contribute to. And of course, I wasn't the first one who wanted to fuzz python: Yevgeny Pats from fuzzit released pythonfuzz in October 2019:
PythonFuzz is a port of fuzzitdev/jsfuzz
which is in turn heavily based on go-fuzz originally developed by Dmitry Vyukov's. Which is in turn heavily based on Michal Zalewski AFL.
The code looked kinda nice and compact. After reading it, the first thing I did was to profile it using cProfile along with snakeviz, on 100.000 iterations of a simple json fuzzer, and the results weren't pretty:
Pythonfuzz is using coverage.py, the de
facto tool to gather coverage information of Python programs. Unfortunately,
it's super-slow. One of the reason for this is that
it's repeatedly calling an abs_file
function to get the absolute path
of files, without any caching, resulting in a lot of syscalls,
thus murdering the performances.
As soon as I added caching with @functools.lru_cache
above the function,
pythonfuzz gained +50% in
performances:
I tried to upstream this change, but unfortunately the coverage.py version used by pythonfuzz was a bit old (4.5.4), and the latest one underwent a complete rewrite of its data storage backend, from in-memory to a sqlite database, making my change obsolete.
So I bumped to the latest available version, and the performances dropped significantly:
I sent a pull-request to get +30% performances when connecting to the database, but most of the fuzzing time was still spent moving data in and out of the sqlite.
So I completely removed coverage.py, and implemented my own tracer. While coverage.py's one is written in C and is thus likely yielding a fuckton of raw performances, it's left dead in the water by a couple of lines of Python by a factor four! Turns out having a lot of features, knobs and a full database as backend can (and does) completely obliterate performances.
An other minor optimization
was to use recv_byte
and send_byte
instead of recv
/send
where possible,
since the later uses pickles
to serialize data. This yields a bit less than 10% on my local benchmark on
the fuzzer's side, by saving a call to pickle's serialize/unserialize per cycle
both on the fuzzee and fuzzer's side.
There is still a lot of room for improvements, but it's now spending around 16% of the time in the fuzzed function, instead of the initial ~3%:
Raw performances are nice, but what about crashes? Well, I've found a couple of bugs in Mutagen in a matter of minutes:
- Timeout while processing a mupack file
- struct.error when processing an mp4 file
- ZeroDivisionError when processing an aiff theora file
- Timeout while processing an mp4 file
- MemoryError when processing an asf file
- IndexError when processing an apev2/wavpack file
- struct.error when processing an id3 file
- ZeroDivisionError when processing an ogg theora file
- IndexError exception while processing a musepack file
- OverflowError exception when processing an aiff file
- struct.error when processing an asf file
- Timeout while processing an mp4 file
- IndexError while processing an ogg file
I have more cool patches for pythonfuzz locally (performance improvements, opcode-based libdislocator equivalent, structure-aware mutators, multi-factor guidance, …), but since the project isn't super-active and I'm still waiting for a couple of my pull-requests to be merged/rejected, I'll keep them private for now. I might fork the project at some point, we'll see.