Libprotoident is a library that performs application layer protocol identification for flows. Unlike many techniques that require capturing the entire packet payload, only the first four bytes of payload sent in each direction, the size of the first payload-bearing packet in each direction and the TCP or UDP port numbers for the flow are used by libprotoident. Libprotoident features a very simple API that is easy to use, enabling developers to quickly write code that can make use of the protocol identification rules present in the library without needing to know anything about the applications they are trying to identify.
Spent a little time reviewing my old YouTube paper in preparation for discussing it in 513.
Tracked down and fixed a few outstanding bugs in my new and improved anomaly_ts. The main problem was with my algorithm for keeping a running update of the median -- I had a rather obscure bug when inserting a new value that was between the two values I was averaging to calculate the median that was causing all sorts of problems.
Added an API to ampy for querying the event database. This will hopefully allow us to add little event markers on our time series graphs. Also integrated my code for querying data for Munin time series into ampy.
Churned out a revised version of my L7 filter paper for the IEEE Workshop on Network Measurements. I have repositioned the paper as an evaluation of open-source payload-based traffic classifers rather than a critique of L7 filter. I also spent a fair chunk of time replacing my nice pass-fail system for representing results with the exact accuracy numbers because apparently reviewers found the former confusing.
Tried to continue my work in tidying up and releasing various trace sets, but ran into some problems with my rsyncs being flooded out over the faculty network. This was quite a nuisance so we need to be more careful in future about how we move traces around (despite it not really being our fault!).
Very short week this week, but managed to get a few little things sorted.
Added a new dataparser to NNTSC for reading the RRDs used by Munin, a program that Brad is using to monitor the switches in charge of our red cables. The data in these RRDs is a lot noisier than smokeping data, so it will be interesting to see how our anomaly detection goes with that data. Also finally got the AMP data actually being exported to our anomaly detector - the glue program that converted NNTSC data into something that can be read by anomaly_ts wasn't parsing AMP records properly.
Spent a bit of time working on adding some new rules to libprotoident to identify previously unknown traffic in some traces sent to me by one of our users.
Spent Friday afternoon talking with Brian Trammell about some mutual interests, in particular passive measurement of TCP congestion window state and large-scale measurement data collection, storage and access. In terms of the latter, it looks many of the design decisions we have reached with NNTSC are very similar to those that he had reached with mPlane (albeit mPlane is a fair bit more ambitious than what we are doing) which I think was pretty reassuring for both sides. Hopefully we will be able to collaborate more in this space, e.g. developing translation code to make our data collection compatible with mPlane.
Exporting from NNTSC is now back to a functional state and the whole event detection chain is back online. Added table and view descriptions for more complicated AMP tests; traceroute, http2 and udpstream are now all present. Hopefully we can get new AMP collecting and reporting data for these tests soon so we can test whether it actually works!
Had some user-sourced libtrace patches come in, so spent a bit of time integrating these into the source tree and testing the results. One simply cleans up the libpacketdump install directory to not create as many useless or unused files (e.g. static libraries and versioned library symlinks). The other adds support for the OpenBSD loopback DLT, which is actually a real nuisance because OpenBSD isn't entirely consistent with other OS's as to the values of some DLTs.
Helped Nathan with some TCP issues that Lightwire were seeing on a link. Was nice to have an excuse to bust out tcptrace again...
Looks like my L7 Filter paper is going to be rejected. Started thinking about ways in which it can be reworked to be more palatable, maybe present it as a comparative evaluation of open-source traffic classifiers instead.
Made a few modifications to Brendon's detectors which make them perform better across a variety of AMP time-series. In particular, the Plateau detector no longer uses a fixed percentage of the trigger buffer mean as its event threshold - instead it uses several standard deviations from the history buffer. Also fixed some problems we were having with being in an event and treating all the following measurements that are similar to those that triggered the event as anomalous. This is a problem in cases where the "event" is actually the time series moving to a new normality: our algorithm just kept us in the event state the whole time!
Once I was happy with that, got the eventing code up and running against the events reported by the anomaly detection stage. Had to make a couple of modifications to the protocol used to communicate between the two to get it working properly (there were some hard-coded entries in Brendon's database that needed a more automated way of being inserted). Tried to get the graphing / visualisation stuff going after that, but there are quite a few issues there so that may have to wait a bit.
Started looking into packaging and documenting the usage of all the tools in the chain that we've now got working. First up was Nathan's code, which is proving a bit tricky so far because a) it's python so no autotools and b) his code is rather reliant on other scripts being in certain locations relative to the script being run.
Added another protocol to libprotoident: League of Legends.
Spent a day messing around with the event detection software, mainly seeing how Brendon's detectors work with the existing AMP data. The new "is it constant" calculation seems to be working reasonably well, but there are still a lot of issues with some of the detectors. Need to spend a bit of uninterrupted time with it to really see how it all works.
Had a quick look at the latest ISP traces with libprotoident to see if there are any obvious missing protocols I can add to the library. Added one new protocol (Minecraft) and tweaked a few existing protocols.
Spent the rest of the week at NZNOG, catching up on the state of the Internets. Most of the talks were pretty interesting and it was good to meet up with a few familiar faces.
Decided to replace the PACE comparison in my L7 Filter paper with Tstat, a somewhat well-known open-source program that does traffic classification (along with a whole lot of other statistic collection). Tstat's results were disappointing - I was hoping they would be a lot better so that the ineptitude of L7 Filter would be more obvious, but I guess this does make libprotoident look even better.
Fixed a major bug in the lpicollector that was causing us to insert duplicate entries in our IP and User maps. Memory usage is way down now and our active IP counts are much more in line with expectations. Also added a special PUSH message to the protocol so that any clients will know when the collector is done sending messages for the current reporting period.
Spent a fair chunk of time refining Nathan to a) just work as intended, b) be more efficient and c) be more user-friendly / deployable. I've got it reading data properly from LPI, RRDs and AMP and exporting data in an appropriate format for our event detection code to be able to read.
Started toying with using the event detection code on our various inputs. Have run into some problems with the math used to determine whether a time series is relatively constant or not - this is used to determine which of our detectors should be run against the data.
Got the bad news that the libprotoident paper was rejected by TMA over the weekend. A bit disappointed with the reviews - felt like they were too busy trying to find flaws with the 4-byte approach rather than recognising the results I presented that showed it to be more accurate, faster and less memory-intensive than existing OSS DPI classifiers. Regardless, it is back to the drawing board on this one - looks like it might be the libtrace paper all over again.
Spent most of my week working with Meenakshee's LPI collector. The first step was to move it out of libprotoident and into its own project, complete with trac - this meant that future libprotoident releases are not dependent on the collector being in a usable state. Added support to the collector to track the number of local IP addresses "actively" using a given protocol. This is in addition to the current counter that simply looks at the number of local IP addresses involved in flows using a given protocol - an IP receiving BitTorrent UDP traffic but not responding would not count as actively using the protocol (i.e. the new counter), but would count as having been involved in a flow for that protocol (i.e. the old counter).
After meeting with Lightwire, it was suggested that a LPI collector that could give a protocol breakdown per customer would be very useful. As a result, I added support for this to the collector. In terms of the increased workload, the actual collection process seems to manage ok, but exporting this data over the network to the Nathan database client doesn't work so well.
Added some basic transaction support to Nathan's code, so that all of the insertions from the same LPI report are now inserted using a single transaction. Ideally, though, we need to be able to create transactions that cover multiple LPI reports - perhaps by extending the LPI protocol to be able to send some sort of "PUSH" message to the client to indicate that a batch of reports is complete.
Went over the collector with callgrind to find bottlenecks and suboptimal code. Found a number of optimisations that I could make in the collector, such as caching the name strings and lengths for the supported protocols rather than asking libprotoident for them each time we want to use them. I also had a frustrating battle with converting my byteswap64 function into a macro - got there in the end thankfully.
Finished up the draft of my L7 Filter paper.
Just a lonely two day week while everyone else was still on holiday.
Released a new version of libtrace (3.0.16) - now Richard's ring buffer code is out amongst the wide world and hopefully our users won't find too many bugs in it.
Got back into writing my paper on L7 Filter. Most of the content is there now, although I'm not entirely convinced that the way I have structured the paper is quite right. It's much more readable the way I have it now, but it looks more like a bulleted list than a technical paper.
Meenakshee's LPI collector worked pretty well running on some trace files over the break, which was pleasing. Next step is to get it working on our newly functional ISP capture point. Tested the capture point out by running some captures over the weekend - aside from a bug in the direction tagging everything looks good, so we have at least one working capture point.
Started writing a paper on my L7 Filter results - managed to get through an introduction and background before running out of steam.
Developed a module for Nathan's data collector that connects to Meena's LPI collector, receives data records, parses them and writes appropriate entries into a postgresql database. Ran into a bit of a design flaw in Nathan's collector - streams (i.e. the identifying characteristics for a measurement) have to be pre-defined before starting the collector. This doesn't work too well with LPI, where there are 250 protocols x 10 metrics x however many monitors one is running. Even worse, the number of protocols will grow with new LPI releases and we don't want to have to stop the collector to add code describing the resulting new streams.
Managed to hack my way around Nathan's code enough to add support for adding new streams whenever a new protocol / metric / monitor combination is observed by my module. Seems to work fairly well (at the second attempt - the first one ran into horrible concurrency problems due to a shared database connection).
Tried deploying the LPI collector at our ISP box, only to find that they've been playing with their core network a lot recently and now we don't see any useful traffic :(
Managed to get native BPF socket capture exporting correctly over the RT protocol. Changed the build system to make it possible to export captures taken using a native socket interface over RT to a machine running a different OS to the capture host, e.g. capture using Linux Native, export to a FreeBSD box.
WDCap now builds and runs on both Mac OS X and FreeBSD. Also changed the way the disk output module names files, based on some code submitted by Alistair King. You now specify your output filename format using strftime-style conversion modifiers, which offers a bit more flexibility to users rather than them being stuck with our particular file naming convention.
Continued working closely with Meenakshee on the new collector. Designed a binary format for exporting our collector messages called the libprotoident collector protocol (or LPICP for short).
Finished collecting traces for most of the protocols I wanted to test with L7 Filter and collated the initial results. Wrote a blog post about it (https://secure.wand.net.nz/content/case-against-l7-filter) and started working on a paper.
L7 Filter is used as a source of ground truth in the traffic classification field because it has been around for a long time and is widely known. However, my experiences with L7 Filter had raised a few questions in my mind with regard to its accuracy. After looking online, I did not find any evidence that L7 Filter is actually an accurate or reliable traffic classifier. In this blog post, I present some preliminary results from my own investigation into the correctness (or lack thereof) of L7 Filter's classifications using packet traces containing traffic for only a single known application.
Back into the swing of things this week. Continued collecting traces of various popular Internet applications to use for validating L7 Filter. So far, L7 Filter is very disappointing - it cannot even correctly classify some basic HTTP flows and often misclassifies SSL traffic as Skype.
Worked with Meenakshee to develop a proper LPI collector that we can run on passive monitors and write live application stats to a database (ideally using Nathan's code). The new collector will use libwandevent and export its results over the network rather than via stdout. To help with this, I extracted the counter / statistic management code from the old lpi_live tool and tidied it up for more general purpose use. Updated lpi_live to use the extracted code.
Spent my spare moments looking over Richard's new ring buffer code for Linux native interfaces in libtrace. In particular, my aim has been test it in situations outside of the standard libtrace paradigm, e.g. using trace_event(), trace_copy_packet() and exporting over the RT protocol.
Alistair from CAIDA has updated libtrace and wdcap for capturing using the BSD native interface (something we never did, so the code was missing or half-assed). I've started integrating his changes back into both code-bases and will also look at the problem of decoding RT packets that were capturing using a native interface that is not supported by the recipient machine, e.g. BPF packets exported to a Linux host.
The week before I left for IMC:
* Finished my draft of the libprotoident paper for TMA. Because of the broken Auckland box, I wasn't able to re-run my analysis using the more up-to-date classification software. Instead, I've just submitted a draft based on the old results, with an eye to possibly updating them should we get accepted.
* Released a new version of libprotoident including all the new protocol rules that I'd added over the past couple of weeks.
* Started working on a little project to measure exactly how hopeless L7 Filter is for traffic classification. So many papers and tools use L7 Filter as either the basis for their rules or as ground truth for validation, which I think is a very bad idea. Hoping to get a paper out of it all. The initial phase of my evaluation involves capturing traffic from a number of common Internet applications and testing whether L7 Filter can correctly identify them. So far, it has managed to get 1/3 right :)
Spent the week before last in Boston for IMC. Managed to successfully present my paper on the Copyright Amendment Act and got a fairly good reception. Also got a chance to meet a few folks and put some faces to names. Some of the presentations were interesting, but there was also a lot of stuff that I found to be less useful (social networks lol).
Libprotoident 2.0.6 has been released today.
This release adds support for 17 new protocols including Spotify, Runescape, Cryptic and Apple Facetime. The rules for a further 7 protocols have been improved.
This release also fixes a couple of bugs - in particular one where lpi_live would report erroneously high packet or byte counts.
We've also deprecated the P2P_Structure category as it was no longer serving the intended purpose due to the rise in BitTorrent file transfers over UDP that are indistinguishable from DHT traffic. All protocols that used to be P2P_Structure are now placed in the P2P category.
The full list of changes can be found in the libprotoident ChangeLog.
Short week this week.
Managed to add a couple more protocols to libprotoident: SUPL and Cryptic (an MMO game company). Spent a lot of time still trying to hunt down the particular Korean P2P application that I'm seeing a lot of in my data, but no success. Nonetheless, I've written a rule for it and added it to our set of "mystery" protocols.
Started looking over our old libprotoident technical report with an eye to submitting it for publication again. There are a few problems with this approach though: 1) OpenDPI doesn't exist anymore. A fork called nDPI lives on, but I'll need to re-run all the validation/comparison tests using nDPI. 2) nDPI uses all the same function and variable names as PACE so these had to be all renamed to prevent horrible linking errors when building / running my comparison program, which links against both libraries. 3) The Auckland monitor that has the only copy of the full-payload traces I had used for part of the original validation is no longer responsive.
Finished up my basic analysis of the libprotoident data from last month. Wrote a blog post (that's on the front page of the website) presenting and discussing the latest results. Some pretty interesting trends are becoming apparent - the surge in HTTPS traffic and the movement towards UDP BitTorrent being the two main ones - which are begging for further investigation.
Continued looking at unknown traffic in libprotoident -- spent much of Friday investigating Korean P2P apps to try and resolve a mystery application that has a very obvious payload pattern, but had little success. Did get to watch a few Starcraft championship games though :)
Wrote and presented a practice version of my IMC talk. Got a few refinements to make but mostly I need to streamline the whole thing so I can deliver it in around 10 minutes without sounding like I'm hyped up on amphetamines.
Spent a fair chunk of my week reading over various chapters from Brad and Joe's Honour's reports, as well as Meenakshee's interim report.
In between times, continued poking at my recent libprotoident analysis looking at the "unknown" traffic. Managed to add quite a few new protocols to libprotoident as a result, including Runescape, Spotify, Fring, Roblox and FASP. Starting to think about a new release with all the protocols I've added over the past couple of weeks.
Also continued my analysis of the September LPI statistics - getting closer to producing some graphs and a blog post discussing the changes over the past year :)
Short week this week - took leave on Thursday and Friday.
Released a new version of libtrace (3.0.15) on Monday. Mostly just a few little bug and build fixes, but it had been a while since the last release. Also submitted a patch for the FreeBSD libtrace port which had been broken for a very long time.
Did a bit more refinement on my Plunge and ArimaShewhart event detectors. They're at a stage now where the number of false positives is close to none. False negatives are a bit harder to identify, of course. The next sensible step is probably to think about testing against real-time data and manually validate the events as they roll in.
Spent a day looking at the latest LPI data from a live analysis I have running on our ISP monitor. Managed to get some up-to-date stats on application usage for last September but haven't had a chance to look over it in detail yet.
I did note a bit of an increase in the amount of unknown UDP traffic, so chased up a few of the more common patterns. Have added 3 new protocols to libprotoident as a result: ZeroAccess (a trojan), VXWorks Exploit and Apple's Facetime / iMessage setup protocol.
At present, accurate traffic classification requires the use of deep
packet inspection to analyse packet payload. This requires significant
CPU and memory resources and are invasive of network user privacy. In this
paper, we propose an alternative traffic classification approach that is
lightweight and only examines the first four bytes of packet payload observed
in each direction. We have implemented our approach as an open-source library
called libprotoident, which we evaluate by comparing its performance against
existing traffic classifiers that use deep packet inspection. Our results show
that our approach offers comparable (if not better) accuracy than tools that
have access to full packet payload and requires less processing resources.
This is simply a technical report, not a published conference or journal paper. We're hoping to publish an improved version of this paper soon, but mainly need to improve the validation process to be more convincing to external reviewers.
Managed to master the art of wavelet transforms - the problems I was having was due to mismatching the scale and wavelet values when inverting the transformation. After a lot of debugging, I was able to ensure that I could reliably transform my data and then invert it back to the same original values for any given number of nested transformations. Once that was working, I was able to get sensible results when denoising my time series.
Now that I had a denoised time series, I turned back to looking at forecasting techniques. Holt-Winters still wasn't a good fit for the denoised data, so I started learning about ARIMA models. Unfortunately, the test data I have doesn't really fit the basic ARIMA models, which made it difficult to get the right fit. Anyway, I now have a decent understanding of how ARIMA works in general, but need to come up with a way to use the ARIMA model in an on-line, self-updating context.
Released libprotoident 2.0.5 on Friday, mainly as something to do so I could have a break from mathematics for a bit :)