Title: Big Bird: killing sqli with graph homomorphisms
Date: 2022-06-26 15:45

Sometimes in 2017, while [blotus](https://github.com/blotus),
[bui](https://github.com/buixor) and I were brainstorming what mitigations we
wanted in [Snuffleupagus](https://github.com/jvoisin/snuffleupagus), we really
wanted to kill [SQL Injections](https://en.wikipedia.org/wiki/Sql_injection),
but it's a hard problem.

During an evening train ride back from a mission for a customer, we were idly
drinking beers and tossing ideas around on how to solve this intractable
issue. Then I remembered that in 2015, I saw TQ's "Square Peg, Round Hole:
Automatically stopping injection attacks with shape matching" at
[Berlinsides](https://berlinsides.org/index55ea.html?page_id=1635) as well as
[Database Firewall from
Scratch](https://raz0r.name/talks/database-firewall-from-scratch/) by
[raz0r](https://twitter.com/theRaz0r) and
[dnkolegov](https://twitter.com/dnkolegov), and thought that they would be a
good starting point. Except that we would do it at the PHP level instead of at
the database one, granting us more context about what is happening. So I
drafted a design document, in French, under the codename [Big
Bird](https://en.wikipedia.org/wiki/Bigbird) and this blogpost is a rough
translation of it. 


## Main idea

The crux here is to assume that only a fraction of a website's SQL
queries is malicious, and that it's possible to detect those that aren't by
comparing the [AST](https://en.wikipedia.org/wiki/Abstract_syntax_tree) of the
original request against the AST with user inputs replaced with constants. Replacements
must be done in a (parallelizable) combinatorial way, for every input source:
`GET`, `POST`, `COOKIES`, … 

Doing so opens the door to second order exploits, allowing an attacker to send
substrings of requests to generate false-positives, like sending `SELECT A, B
FROM` to be passed to `SELECT A, B FROM MYTABLE WHERE A.ID = "?";`, resulting
in `CONSTANT MYTABLE WHERE A.ID = "CONSTANT"`. This can be used to arbitrary
block some requests, but on the bright side, it's not impossible to generate
false-negatives, up to
[homomorphisms](https://en.wikipedia.org/wiki/Homomorphisms) of course, but if
your application is vulnerable to data-only SQL attacks, you've got logic
issues that can't be mitigated in a generic way anyway.

A table of legitimate AST can be cached in a hashtable, with something like
`H(file_name | line_number | simplified_stacktrace_to_remove_loops_and_dups |
…)` used as keys to tie legitimate SQL requests AST to where they're actually
used/called from. This can be used to immediately identify legitimate requests,
instead of doing the (costly) "replace user data and compare" dance, if someone
sent a request with the same AST before.

##  Example

Original query: `SELECT * FROM T WHERE ID="$1";`

If `$1` is set to `1`, the query will be `SELECT * FROM T WHERE ID="1"`,
parsed as `S * F T W C="STR"` without *constantification*, and
`SELECT * FROM T WHERE ID="CST"` aka `S * F T W C="STR"` with: they have the
same AST, it's a legitimate query, we should store it into the cache.

If `$1` is set to `1" OR 1 = 1`, the query will be `SELECT * FROM T WHERE ID="1" OR 1 = 1"`,
parsed as `S * F T W C="STR" O 1 = 1"` without *constantification*, and
`SELECT * FROM T WHERE ID="CST"` aka `S * F T W C="CST"`: the two AST are
different, and `S * F T W C="STR" O 1 = 1"` has never been seen before (it's
not in the cache), meaning that this is an injection (or a false-positive).

If `$1` is set to `WHERE`, the query will be `SELECT * FROM T WHERE ID="WHERE"`,
parsed as `S * F T W C="STR"` without *constantification*, and
`SELECT * FROM T CST ID="CST"` aka `CST * F T W C="CST"`, but while the
AST are different, `S * F T W C="STR"` is present in the cache, meaning that
this isn't an injection.

## Limitations

An attacker could DoS its own process by sending a lot of variables, but isn't
able to poison the known-legitimate-AST cache. The major limitation is that if
the user's input is transformed before reaching the database, all bets are off,
but this should™ be quite rare. Another pitfall is that an attacker is able to
generate false-positives for never-seen-before legitimate queries, allowing to
block the execution of arbitrary queries. This can be mitigated by using a
crawler, or using a non-blocking mode for some time, to populate the cache.

## Current and future status

I have a rough proof-of-concept of the implementation, but never published it,
since it was quite brittle, horribly hackish and with a terrifying SQL parser
contraption. I don't plan on implementing it further in Snuffleupagus, but I'm
always happy to review and merge pull-requests if one feels inclined to send
some.

Moreover, if you think of improvements or bypasses, please do [tell
me](https://dustri.org)!

This isn't a fresh idea anyway, similar work has been presented in [2005](https://www.researchgate.net/publication/221215947_Using_parse_tree_validation_to_prevent_SQL_injection_attacks),
and has roots in Robert J. Hansen and Meredith L. Patterson's 2005 paper, [Guns and Butter: Towards Formal Axioms of Input Validation](https://www.blackhat.com/presentations/bh-usa-05/BH_US_05-Hansen-Patterson/HP2005.pdf)
<!-- https://static.miki.it/pdf/using_parse_trees.pdf -->