Paper review: Statistical and Combinatorial Analysis of the TOR Routing Protocol
Sun 17 January 2021 — download

My previous article on a paper from Éric Filiol about Tor was well-received and sparked a lot of interesting conversations, so here is a review of a similar one: Statistical and combinatorial analysis of the TOR routing protocol: structural weaknesses identified in the TOR network (local mirror), written again by people from the Laboratoire de Virologie et de Cryptologie Opérationnelles (C + V)O from the ESIEA: É. Filiol, and two of his interns: Nicolas Job and Maxence Delong (now pursuing a doctorate on the topic).

Nicolas Job was credited as being both from the (C + V)O laboratory, but also from the "Department of Defense, Paris, France": this is plain wrong. He was doing a 6 months internship for his ESSI degree, a certification in 7 months plus an internship, delivered by the ANSSI. The ANSSI is reporting to the Secretariat-General for National Defence and Security (SGDSN), which despite its confusing name, doesn't report to the Ministry of Armed Forces (the equivalent of the USA's DoD or UK's MoD) at all, but to the prime minister instead. For the curious minds, the reason to have such a separation in France is that the SGDSN's role, amongst others, is to define and enforce rules regarding Classified information: it wouldn't be possible for it to inspect the Army had it to report to it.

I previously wrote the following about É. Filiol, and I'm still standing by my word:

Eric Filiol is known for pretending to have broken AES in 2002 (he didn't), and in 2003 (he still didn't) and Tor in 2011 (he didn't either), and for being the architect and designer of DAVFI, a French "new generation anti-malware solution", known for a being a phenomenal (and extraordinary expensive) source of fun.

The paper was published in the Journal of Computer Virology and Hacking Techniques in March 2020, apparently because it was presented at the 2nd International Workshop on FORmal methods for Security Engineering in 2018. It was also presented in 2018 at the The International Conference on Information Systems Security and Privacy 2018. It was also presented at the 13th International Conference on Cyber Warfare and SecurityICCWS 2018 the same year. É. Filiol has been presenting at this conference in 2017 ("Automated Intelligence Gathering Through Comparison of JPEG Images and their Thumbnails"), 2016 ("Combinatorial Optimization of Operational (Cyber) Attacks against Large-scale Critical Infrastructures: The Vertex Cover Approach"), and 2014 ("Critical Infrastructures: Where we stand today?") as well.

All of those conferences claim to have strict review processes.


Introduction

The various computer networks such as the Internet today allow everyone to have many means of communication. However, this access is only possible if the user has an IP address. This condition therefore allows entities that have access to the network flows, to analyze it and thus to retrieve information about particular users.

"IP" stands for Internet Protocol: you can do networking without IP address, like Appletalk. Moreover, there are ways/systems to achieve decent anonymity even against an attacker who has a global view of the network, like pond for example.

Communities have been formed to defend the anonymity of each other and have developed networks whose purported purpose is to enable people to navigate without anyone being able to identify the origin and destination of the communication. These networks therefore seem to be an ideal tool for criminals and terrorists of all kinds, as they can carry out their actions with little trace.

Privacy is a fucking elementary human right. Moreover, if we want to go the bad faith way:

  • Road enable people to transport goods without oversight, making them ideal to transport drugs, weapons and bodies.
  • Butchers are wielding knifes all day long, making it an ideal profession for wannabe murderers.
  • Bar and pubs are loud and anonymous places, making them ideal for conspirations.
  • Libraries are full of horrible books, making them ideal places to radicalise the youth.

The purpose of this article is therefore to study a particular network of anonymization, both for clients and services: The Onion Router (TOR for short). Its purpose is to protect the confidentiality, integrity and availability of exchanges within its organization. This network relies on a community that uses a particular network protocol.

As usual, it seems that there is a confusion about the terminology:

  • The Tor Project is a non-profit (charity) organization that maintains and develops the Tor software.
  • tor is a program to use the Tor network
  • The Tor Network is the set of all Tor relays

Amusingly, there is the following related paragraph on the Tor Project website:

Note: even though it originally came from an acronym, Tor is not spelled "TOR". Only the first letter is capitalized. In fact, we can usually spot people who haven't read any of our website (and have instead learned everything they know about Tor from news articles) by the fact that they spell it wrong.


The open source code is documented, but many aspects such as route construction are only briefly documented. This is contrary to the principles of security, since TOR’s security of this network is based on a considerable amount of obscurity and secrecy.

This is just plain bullshit. The main implementation is completely open-source, and people even wrote a lot of different implementations in java, erlang, haskell, go, in go again, C++ in a ~20kb binary, javascript Python, Python in ~1200 lines of code, Python again in ~1000 lines of code, C#, rust, …

If the specification was cryptic or unclear, it's unlikely that so many people would have written this amount of implementations.

The gotor author even said:

The network has a specification document, but carefully following it will result in an implementation that does not work (this has now been corrected). This posed some challenges, but luckily the CTor source code is relatively easy to read.

Viva64, the company behind the static analysis suite PVS Studio, known to regularly publish blogposts showcasing errors/mistakes they found in popular open-source projects wrote:

My congratulations to the authors of the Tor project. I didn't manage to find any errors after the analysis by PVS-Studio static code analyzer. We write such words very rarely, so the authors may really be proud. They do deserve a medal for the high-quality code.

And even if there wasn't any kind of documentation, the consensus format is in plain-text, and pretty obvious:

Consensus illustration


Therefore, would it be possible for an attacker who controls a limited number of routers to be able to control and impact a significant part of the traffic and hence more or less would be able to damage TOR's security?

Why the "therefore"? Even if tor wasn't open-source (it is), with no documentation, nor specification (tor is heavily publicly specified), it wouldn't automatically means that an attacker controlling nodes would be able to perform attacks.

In a second part, the analysis of this problem from the point of view of graph theory and sub-problems of optimal paths and vertex cover has made it possible to identify particular reduced subsets of nodes on which a large proportion of the TOR traffic depends. Taking control of these nodes by an attacker could significantly damage the security of the TOR network.

If the goal was to find what nodes are worth compromising, those data are already public and don't involve fancy mathematics.

Description of the tor protocol

The Tor Project Inc. is a non-profit foundation located in Massachusetts. It was founded in December 2006 by seven computer scientists, including Roger Dingledine and Nick Mathewson. It is primarily responsible for software maintenance — including the Tor Browser and Tail (sic.) — for the Tor anonymous net- work.

While the Tor Project Inc it is indeed maintaining the Tor Browser, Tails is developed independently.

Today the organization is funded by donors and government agencies, including the US Department of State Bureau of Democracy and the German Federal Foreign Office (see (Delong et al., 2018) for a deep OSINT analysis of the Tor project and foundation).

Tor's sponsors are way more diverse than just two clickbaity USA agencies, and was only funded in 2015 by the GFFO.

The cited paper is rubbish, and the main author isn't Delong, but Filiol: he's citing his own paper.

The Tor network consists of about 12,000 routers divided into two families, relays and bridges. The Tor Foundation publishes almost all the information in the first category, and quite nothing for the second category in order to prevent a state from denying access to the network because it does not have the IP addresses of these hidden nodes. The Tor network is organized into several levels where routers have multiple roles.

It seems that there is another terminology confusion:

  • A bridge is a non-public relay
  • A relay is a router and is a node

Four Authority levels are identified: - Nine Directory authorities and Bridge authorities. - Directory caches - Fallback directory mirrors - HSDir

It's a weird custom way to classify things, since there are no mentions of any "Authority levels" anywhere in the specification nor in any implementation that I'm aware of. It seems that there is an other confusion here, since the four "authority levels" are in fact flags, and there are not four of them, but twelve main ones: "Authority", "BadExit", "Exit", "Fast", "Guard", "HSDir", "NoEdConsensus", "Stable", "StableDesc", "Running", "Valid" and "V2Dir", as well as various additional ones, like "ReacheableIPv6" "IPv6Exit", "FallbackDir", …

the Guard node is an input router to the Tor net- work. It also has the role of non exit router and can therefore process internal network flows.

An exit relay can be a guard, it's just that since they're scarce, they are not prioritized.

the Bridge node (TOR Foundation, 2014d) is a router whose information is only partially published and whose purpose is to allow customers to bypass censorship.

There are no customers, only users.

As for this particular class, we have developed an automatic ex- traction procedure which is non public. At the present time, a first list of more than 2,500 bridges (on a total number of 3,000-3,500) have been extracted (Filiol et al., 2017) as well as many information about the way bridges are managed behind the scene. Recently the Tor foundation has increased the number of bridges but this does not change anything with respect to our extraction procedure.

Just like E. Filiol found a magical way to break AES, but it's non public, pinky swear. This kind of rubbish affirmations has no place in a scientific paper.

The introduction point is a router (relay type) connected to a service hidden by a particular circuit and which allows a client to submit an appointment request for establishing a connection.

There is no such thing as "appointment request" in the protocol, this looks like a bad double-translation (English → French → English) of "rendezvous point".

The Tor protocol manages two types of communications, either to the Internet network (external circuits, three routers) or to its internal network (internal circuits, six routers) through which an end-to-end flow will be exchanged between a client and a server.

"Exchanging an end-to-end-flow" isn't proper English, and I guess should be understood as "an end-to-end encrypted connection will be established between a client and a server".

To build TOR routes (or circuits), clients and servers of the Tor network build routes consisting of three second level routers

No idea what a "second level router" could be.

Statistical analysis of the tor routing protocol

We essentially focuses on the public routers in the rest of the papers (external circuits). As far as bridges are concerned, it is sufficient to say that we have obtained similar results (6-node circuits) as for public routers. Other information are not public.

Pinky swear!

The Tor routing protocol relies on the consensus file which contains all the available TOR routers

All of them except bridges.

A very first analysis (see Figure 1) clearly shows that the distribution of routers is far from being uniform and exhibits a wide dispersion.

Here is the Figure 1.

Fig 1.

There are no units on both axis, a green line name g(x), and apparently the violet crosses are "proba_middle.dat" u 2, whatever this means.

In the minds of the general public, the choice of each router is made randomly. This popular belief is false insofar as it implies that each node has the same probability of being chosen.

I call bullshit on this. Of course, nodes don't have an equiprobable chance of being selected: stabler, bigger, older, … relays have more chance of being picked than a brand new slowpoke blinkenlight potato, if only for scalability's sake which the paper actually agrees with later on.

Building a circuit requires the choice of three separate routers. In the case of an external route, the protocol will have to choose a guard, a middle and a exit node. To make this choice, the protocol groups routers by type. If necessary, it eliminates routers that are not specification-compliant. Then, it computes the different sums of the weights and chooses a random value $$r$$ between $$0$$ and this sum. Finally, it scrolls through the list of the type considered by summing up to $$i$$-th rank such as $$\sum_{j=0}^{j=i-1} weight_j < r$$ and $$sum_{j=0}^{j-i} weight_j > r$$

The mathematical formula only means that the client will split the list of relays in two, but doesn't explain why.

This choice is not a priori questionable since it weighs certain routers according to their stability, duration of network activity and available bandwidth. However, this weighting is already partly undermining the supposed equiprobability.

There is no "supposed equiprobability".

Moreover, there are also other factors, like the AS and the /16 because you don't want to have all your 3 relays too close to each other for example.

Indeed, if it is possible to favor certain routers, it is therefore possible for the Tor foundation to restrict the choice of routers to particular subsets.

The Tor foundation doesn't operate any relays, let alone authorities. And of course, should 6 or more authority operators conspire together to serve a skewed view of the network, nothing would prevent them from doing so, by design, since they're the ones creating the consensus.

Theoretical Statistical Model

Since the open source code of Tor is very large and complex, we first implement the automatic extraction of all relevant data the writing of routers given by the details file and to analyze them.

"We were too lazy to actually read the source code, so we tried to make sense of the raw consensus data instead".

%% C_{middleguard} = \frac{C_{guard} × C_{middleguard}}{ C_{guard} − C_{middleguard}} %%

This one is actually hilarious:

%% \begin{aligned} C_{middleguard} &= \frac{ C_{guard} \times C_{middleguard} }{ C_{guard} - C_{middleguard} } \\ (C_{guard} - C_{middleguard}) \times C_{middleguard} &= C_{guard} \times C_{middleguard} \\ C_{guard} \times C_{middleguard} - C_{middleguard}^2 &= C_{guard} \times C_{middleguard}\\ - C_{middleguard}^2 &= 0\\ C_{middleguard} &= 0\\ \end{aligned} %%

The other formula don't make more sense.

$$C_{middle\_seul}$$

Nice frenglish here, "seul" meaning "alone".

Indeed, the specifications of the Tor network require that two routers on the same circuit do not belong to the same x.y.z.w/16 subnetwork (class B) as well as to the same (Johnson et al., 2013) family.

Nobody is using classful networks anymore. Also, it's weird to cite a paper from 2013 instead of referring to the current specification.

This has the effect of influencing client choices. Let us precise that a highly multi-threaded implementation enables to process any new version of the consensus file in less than 5 minutes (around 10 billions of possible routes).

This is pure non-sense: what does "processing means"? Metabase is able to load the consensus instantly. This reads like a bragging tentative from the intern who implemented whatever processing was "needed" for the paper. And where is the number "10 billions" coming from? With ~7000 relays and ~1500 bridges, even by not using relays in the same /16, in the same AS, with the same family, the number of possible routes is way higher than 10¹⁰.

Our theoretical model must now be compared with what is observed within the Tor network in order to validate it or not.

It's not a theoretical model, since it is entirely based on the consensus' data. Taking data, deriving some pseudo-formula from it, and then check if the formula matches the same data is in fact an interesting approach.

In order to have a representative sample with a margin of error of 1% and a confidence level of 99%, we must consider at least 16,500 TOR routes.

Where are those number coming from‽ And how can taking 16,500 routes be enough to have a confidence level of 99% when there are "around 10 billions of possible routes"?

Figure 2 represents the occurrences obtained for the different types of routers when generating routes and shows that, according to the established model, they follow a power law. Therefore, the weight assigned to each router is the parameter that influences the choice made by the protocol (see Table 2).

Here is Figure 2.

Fig 2.1 Fig 2.2 Fig 2.3

It's interesting to notice that this is a logarithmic scale, but instead of having the axis labelled as such, it's on the data itself that the logarithm is computed, forming a weird variant of log-log plot that is usually used in economy, because a straight line would indicate a power-law: it's obviously not the case here.

It is also important to look at the case of internal circuits (used to access to hidden services). In this case, the probability of the third router on the circuit cannot be established directly from the data provided by the Tor foundation. It was therefore necessary to identify the proportions of each type of router (guard, middle, exit) positioned in third position.

It's more complicated than that, since you want to avoid reusing the same relay anywhere in your 6 (or 3 when using non-anonymous hidden services) relay-chains when accessing a hidden service.

In addition, the study was conducted on a small but representative sample size but it is acceptable that on a large number of simulations, the results would converge to the theory.

"We only did a few tests that kinda matched, but we believe that should we have done it the right way, it should still work."

This is not how research is supposed to be done.

Combinatorial analysis of the tor routing protocol

Indeed, the previous study of routes has focused only on routers and not on the links between nodes.

This is non-sense, people have been studying tor graphs in almost every way, including routes, just search for "path" or "flow" on anonbib.

Without loss of generality, this new model deals with at external circuits only and when the input router is a guard. Additional results can be obtained by contacted the first author.

Notwithstanding the broken English, it seems that we need to contact an author to get the actual data/model used. Next step might be as DLC with a season pass to get everything that should have been included in the paper?

The rest of this part is about computing various distributions of the routes, to find that the more bandwidth a node has, the bigger is its probability of being picked for routes.

Discussion about the tor security

Results presented in Section 4 (especially Tables 4 and 5) prove that it is enough to target a reduced subset of nodes (only 1,463 nodes (1,217 middle, of which 1,030 guards and 246 exits) to control 50% of the traffic in Table 4) .

I wouldn't use the word "only" to talk about ~1500 nodes: we're talking about 20% of the tor network.

Thus, it would be able to remove the anonymity of users — clients and hidden services — by tracking packets end-to-end through special patterns identified or set up in the packets. It seems that the FBI operated in this way when it dismantled the silk road in 2014

This is pure fantasy.

These node subsets of higher interest are then privileged targets for DDoS (mandatorily from outside the Tor network) or coordinated targeted malware attacks. Therefore, it is interesting to carry out a security analysis of the ports and services which are opened for each of them, since they can constitute potential entry points (a vulnerability scan to look for 0-day vulnerabilities is also a required step).

If you have an RCE in tor, there is no need to target specific nodes since you can just spray. If you have an RCE in Linux, there are easier and better ways to get control of the network…

Also, I'm curious about what the authors meant by "a vulnerability scan to look for 0-day vulnerabilities".

We have used a massive scan tool designed in our lab to analyze top significant routers involved in 50% of the Tor traf- fic, as well as the first thousand ports and a few ot- her frequently used ones

I guess nmap isn't hype enough?

In addition, these ports are possible targets that can be overwhelmed by syn flooding attacks.

We're not in the 90s, SYN flood isn't a thing anymore.

As far as bridge routers are concerned, a corrupted machine in middle position could count the number of network frames it is relaying when constructing a circuit. It could then determine its own position and hence compare the IP address of the machine right before its own position with those listed by the foundation. It would enable to determine whether the input router is a bridge or not. In order to prevent such an attack, it is envisaged to always insert a guard router between the bridge and the middle router. Thus, hidden nodes would be drowned within network users.

This attack is outlined in Roger Dingledine's Ten ways to discover Tor bridges from 2011, summarized in a blogpost from the Tor project. Also, the solution proposed here is naïve compared to the one from Dingledine's paper.

Our study tends to prove that the security of the Tor network is not optimal. Indeed, since overall security is based on the individual good practices of the router owners, the maximum level that will be reached will not exceed that of the weakest link.

This is also bullshit: the tor network is designed to accommodate a minority of malicious/weak/misconfigured/potato/… nodes. Imagine if this wasn't the case, and if the proposition was true: a single malicious node would bring "the security of the Tor network" down to zero.

It is also important to note that local security (of nodes and servers, protocol strength...) is not sufficient as soon as we deal with a worldwide infrastructure.

I don't understand where this is coming from, or even what it means.

Having a more global and higher point of view is also important. In this respect, our study has showed the Tor infrastructure too much relies actually on a reduced number of nodes.

The diversity problem has been known for years, and there is nothing novel presented here.

Amusingly, a couple of graphs are presented in the appendix of the paper, to illustrate that whatever they are showing (there are no units on them) is a power-law. Unfortunately, it looks way more like a normal distribution:

Figure 5: Inverse cumulative density functions of data compared to the fitted law. The red dotted line (resp. green and blue) describes the power law (resp. log-normal and exponential law).

Fig0 Fig1 Fig2 Fig3

Figure 6: Inverse cumulative density functions of data compared to the fitted law (1-billion top TOR routes). The red dotted line (resp. green and blue) describes the power law (resp. log-normal and exponential law).

Fig4 Fig5 Fig6 Fig7

They are respectively designated as "All OR", "OR guard", "OR middle", "OR exit", "All OR", "OR guard", "OR middle" and "OR exit".


Conclusion

Like the previous paper, this one is also written in broken English, full of silly mistakes (which could be excusable if the paper wasn't reviewed by 3 different conferences and a journal.), filled with approximations, sheer clickbaity inventions and bad approximations about Tor.