Automatic Protocol State Machine Generation
Being able to build a state machine representing a protocol using only minimal packet header information could be useful as a lightweight recogniser for protocol identification without payload, or for detecting obfuscated/encrypted protocols with few distinguishing features.
Currently this work involves building an SMTP state machine from object traces (using only object direction and size) which are clustered together using WEKA and built into a tree, then trying to minimise the tree by merging transitions that are similar. Transitions are merged based on the similarity of the object size distributions using a Kolmogorov Smirnov test.
The result of this can be treated as a non-deterministic finite state machine and used to recognise flows as being SMTP.
Generated a full dataset for Chris using all the available AMP monitors.
Looking at the network maps he produced showed up some unusual
topology that I didn't expect to see and didn't agree with, so I
looked further into what was going on with the traceroute data. It turns
out that quite a few devices in the path respond on multiple addresses, so
I spent some time investigating de-aliasing using scamper. Using
the radargun method across the address set came up with 11000 pairs of
addresses representing 500 devices. I'm not entirely convinced of the
accuracy yet so I will run it a few more times and compare results, but it
has been correct for the addresses I have manually confirmed as being
I think I've solved my problem of direct loops being created by the smtp
state machine and the new machine is at least as good as the previous ones
on my initial test trace. Will need to test it on some more traces now.
Having trouble visualising the machine though, as dot won't create a graph
using the layout algorithm that I want.
Kept looking at generating better data for Chris to use in his maps. Wrote
some simple code to try to remove destinations that are obviously
international using latency measurements and hostnames. I err on the side
of caution though so there are still a few included that are quite likely
Got William set up with live access to json weathermap data so that he can
start testing his new weathermap in a more realistic fashion.
Some different visualisations of my smtp state machine graphs show some
nodes with a huge number of links, all at various stages of the protocol
that it probably shouldn't be able to transition between. These tend to
link to another node with a large number of transitions that also links
back, so tight loops are formed with many transitions. Direct loops should
be prevented from forming (they are being checked for at least), but it's
possible that check is missing a case or the loops are being formed as
side effects of merging others states. Need to keep wading through the
step by step merging output to see how this is coming about.
Collected some new traceroute datasets for Chris containing fewer
unresponsive hops. Still quite a significant number in the middle of
the path don't reply and it seems they probably won't due to the way they
are configured. May need some heuristics to try to merge unresponsive hops
that are probably the same device.
Saw an NZNOG post about web page loading times ("user experience") and
thought it might be interesting to do a bit more work with the web
download speed data we've been collecting on a couple of AMP monitors.
Wrote a script to convert the data from the http2 test into the HTTP
Archive (HAR) format so I could use their tools to generate waterfall
graphs similar to what firebug produces. This seems to be working fine
except that data from the Waikato monitor is showing sub-millisecond
connection times to websites hosted in the US. Spent some time looking
inside libcurl to see if the problem could be there, but it turns out that
it is the fortinet on the edge of the university network acting as a web
proxy that is breaking my results.
In trying to generate a nice state machine graph to use to illustrate some
points I decided that the merged tree looked really bad - it had large
numbers of transitions between the same pairs of nodes. Generated a lot of
graphs showing the distributions of object sizes across those transitions
to confirm that they were all distinct (in general they were).
After seeing the graphs Chris made with a small number of destinations and
a single testing site, I found a few more sources of addresses that are
used within New Zealand (though not necessarily assigned to NZ
organisations) and started generating a new test data set. The new dataset
will have data for 28,000 /24s from four sources once collection
completes. Will need to have a look and see if this resolution is
Also put together a bunch of sample data and databases for William to use
with the new weathermap. It mimics the KAREN set up, so hopefully he can
use that to write scripts to deal with getting the data into the right
Spent some time picking up the smtp state machine documentation again,
trying to get some images of example trees to show the merging process.