Storage of Network Monitoring and Measurement Data
I am designing and building a system to allow for storage and retrieval of large amounts of network measurement and monitoring data. I need to make a flexible system that will be capable of dealing with a wide range of data such as polled data and data from flows as well as being fast enough to cope with high rates of live network data. The end goal is to provide this information to an anomaly detecting algorithm that can detect changes in the network and alert system administrators of the exact problem as well as presenting the information using graphs via a web interface.
Another short week, due to being away on Tuesday and Wednesday.
Started writing up a decent description of the design and implementation of NNTSC, which would hopefully make for a decent blog post. It also means that the entire thing is stored somewhere other than in my head...
Revisited the eventing side of our anomaly detection process. Had a long but eventually productive discussion with Brendon about what information needs to be stored in the events database to be able to support the visualisation side. We decided that, given the NNTSC query mechanism, events should have information about the collection and stream that they belong to so that we can easily filter them based on those parameters. We used to use "source" and "destination" for this, but streams are defined using more than just a source and destination now.
Updated anomalyfeed, anomaly_ts and eventing to support the new info that needs to be exported all the way to the eventing program. In the process, I moved eventing into the anomaly_ts source tree (because they shared some common header files) and wrangled automake into building them properly as separate tools. Got to the stage where everything was building happily, but not running so good :(
Very short week this week, but managed to get a few little things sorted.
Added a new dataparser to NNTSC for reading the RRDs used by Munin, a program that Brad is using to monitor the switches in charge of our red cables. The data in these RRDs is a lot noisier than smokeping data, so it will be interesting to see how our anomaly detection goes with that data. Also finally got the AMP data actually being exported to our anomaly detector - the glue program that converted NNTSC data into something that can be read by anomaly_ts wasn't parsing AMP records properly.
Spent a bit of time working on adding some new rules to libprotoident to identify previously unknown traffic in some traces sent to me by one of our users.
Spent Friday afternoon talking with Brian Trammell about some mutual interests, in particular passive measurement of TCP congestion window state and large-scale measurement data collection, storage and access. In terms of the latter, it looks many of the design decisions we have reached with NNTSC are very similar to those that he had reached with mPlane (albeit mPlane is a fair bit more ambitious than what we are doing) which I think was pretty reassuring for both sides. Hopefully we will be able to collaborate more in this space, e.g. developing translation code to make our data collection compatible with mPlane.
Exporting from NNTSC is now back to a functional state and the whole event detection chain is back online. Added table and view descriptions for more complicated AMP tests; traceroute, http2 and udpstream are now all present. Hopefully we can get new AMP collecting and reporting data for these tests soon so we can test whether it actually works!
Had some user-sourced libtrace patches come in, so spent a bit of time integrating these into the source tree and testing the results. One simply cleans up the libpacketdump install directory to not create as many useless or unused files (e.g. static libraries and versioned library symlinks). The other adds support for the OpenBSD loopback DLT, which is actually a real nuisance because OpenBSD isn't entirely consistent with other OS's as to the values of some DLTs.
Helped Nathan with some TCP issues that Lightwire were seeing on a link. Was nice to have an excuse to bust out tcptrace again...
Looks like my L7 Filter paper is going to be rejected. Started thinking about ways in which it can be reworked to be more palatable, maybe present it as a comparative evaluation of open-source traffic classifiers instead.
Turns out that once again, the current design of NNTSC didn't quite meet all of the requirements for storing AMP data. The more complicated traceroute and HTTP tests needed multiple tables for storing their results, which wasn't quite going to work with the "one stream table, one data table" design I had implemented.
Managed to come up with a new design that will hopefully satisfy Brendon while still allowing for a consistent querying approach. Implemented the data collection side of this, including creating tables for the traceroute test. This was a bit trickier than planned, because SQLAlchemy doesn't natively support views and also the traceroute view was rather complicated.
Currently working on updating the exporting side to use id numbers to identify collections rather than names, since there is no longer any guarantee that the data will be located in a table called "data_" + module + module_subtype.
Also spent a fair bit of time reading over Meenakshee's report and covering it in red pen. Pretty happy with how it is coming together.
Short week this week, as I spent Thursday and Friday in Wellington at the cricket.
Wrote an assignment on libtrace for 513, along with some "model" answers.
Continued reading and editing Meenakshee's report.
Had a vigorous discussion with Brendon about what he needs the NNTSC export protocol to do to support his AMP graphing needs. Turns out the protocol needs a couple of new features, namely binning/aggregation and a no-more-data indicator, which I started working on adding. So far, this has mostly involved taking some of the working code from my anomaly detector feeder program, which is an NNTSC client, and turning it into a NNTSC client API.
Put out a request to our past students for their Honours reports so that they can be published on the website. Thanks to those who have responded.
Despite the limitations of current network monitoring tools, there has been
little investigation into providing a viable alternative. Network operators need
high resolution data over long time periods to make informed decisions about
their networks. Current solutions discard data or do not provide the data
in a practical format. This report addresses this problem and explores the
development of a new solution to address these problems.
Added a data parser module to NNTSC to process the tunnel user count data that we got from Lightwire. Managed to get the data going all the way through to the event detection program which spat out a ton of events. Spent a bit of time combing through them manually to see whether the reported events were actually worth reporting -- in a lot of cases they weren't, so I've refined the old Plateau and Mode algorithms a bit to hopefully resolve the issues. I also employed the Plunge detector on all time series types, rather than just libprotoident data, and this was useful in reporting the most interesting behaviours in the tunnel user data (i.e. all the users disappearing).
Added a couple of new features to the libtrace API. The first was the ability to ask libtrace to give you the source or destination IP address as a string. This is quite handy because normally processing IP addresses in libtrace involves messing around with sockaddrs which is not particularly n00b-friendly. The second API feature was the ability to ask libtrace to calculate the checksum at either layer 3 or 4 based on the current packet contents. This was already done (poorly) inside the tracereplay tool, but is now part of the libtrace API. This is quite useful for checksum validation or if you've modified the packet somehow (e.g. modified the IP addresses) and want to recalculate the checksum to match.
Also spent a decent bit of time reading over chapters from Meenakshee's report and offering plenty of constructive criticism.
The NNTSC export protocol is complete now and happily exports live data to any clients that have subscribed to data streams that are being collected. Using this, I've been able to get the anomaly detection tool chain working with our SmokePing data right up to the eventing phase. Fixed a minor bug in the eventing code that would result in badly-formed event groups if the events do not strictly arrive in chronological order (which can happen if you are working with multiple streams of historical data).
Fixed a few libtrace bugs this week - the main one being trace_event being broken for int: inputs. It was just a matter of the callback function being registered inside the wrong #ifdef block but took a little while to track down.
Spent the latter part of my week tidying up my libtrace slides in preparation for a week of teaching 513 later this month.
The new NNTSC now supports the LPI collector and installs nicely. Still waiting on Brendon to get his AMP message decoding code finished to his satisfaction and that will be the next thing to get working on the data collection side.
Also started developing a new data query / export mechanism that allows clients to connect and register their interest in particular data streams to receive ongoing live data as it is collected by NNTSC. The old approach for this involved the client explicitly stating the ID numbers for the streams they wanted data for, which was pretty suboptimal because it required knowledge that should really be internal to the database.
The other problem is that now we have different data table layouts for each different type of data stream, we need to inform clients about that structure and how to interpret the data stream.
All of this meant that I've had to design and implement an entirely new protocol for NNTSC data export. It's a request/response based protocol - the client can request the list of collections (i.e. the different data stream types), details about a specific collection (i.e. how the stream and data tables are laid out) and the list of streams associated with a given collection. It can then subscribe to a given stream, giving a start and end time for the time period required. If the time period includes historical data, that is immediately extracted from the database and sent. If the time period includes future data, the client and stream id is remembered so that new data can be sent to the client as it rolls into NNTSC.
Currently at the stage where we're getting historical data out of the database and preparing to send it to the client, with live data being the next main task to do.
Well the year is pretty much over now. Just submitting the last few assignments.
520 Report hand in went well. I gauged the time well and happily submitted on time. Although my word count was lower than everyone else I feel the reduction in words was made up by a stronger focus and keeping my arguments concrete. I guess I'll find out if that was wise in due course.
It's been an awesome year being part of WAND and everything that entails. Thanks to everyone involved and good luck in the future.
Been working pretty much constantly on my report for the past few weeks now. At this point I have a bit of tidying up before I write my conclusion and hand it in on tuesday. Overall I'm pretty happy with the report. I've received really good feedback from Brendon and Scott regarding how to improve and clean up my writing. At this point I'm on schedule to finish on time.
Pretty full on with assignments this week and starting to work through writing my report. I hacked up a webpage, with a little help from Joe, so people can see our progress http://wand.net.nz/~no15/index.html . Hoping to get most of the WAND 520 students on there too.
Had a play with pypy (JIT Complier for python). Should be handy for getting more performance out of my collector. Also thought about creating an importer specifically for bulk importing data as running the collector can be slow for imports of existing data.
Another busy week with Assignments. It's getting to that crunch time of the year, and what with thhe 520 final write up I suspect it's going to be a pretty hard slog.
Attended the HPC Roadshow on Thursday along with Brad, Brendon, and Shane. We met up with Jamie there too. Some interesting talks and it's good to build knowledge about subjects outside my usual domain.
Had my honours presentation this week. Overall I was pretty happy with how things went. I felt The feedback I received during my practice talk made a major difference to the final presentation. I'd like to thank Richard (my supervisor), Scott (from Lightwire), Tony, Brendon, Shane, and all the other WAND people who helped critique my original presentation.
Looking at my 520 now I need to find some time to finish up the final major component, the connector. Then testing and benchmarking to see what I need to do performance wise. Making some rough notes for the final report is probably not a bad idea either.
Spent last week preparing for my 520 presentation. Practice talks went well.
Last week wasn't very productive. I was mostly tied up with lectures and revising. I did give Shane my project to start using. He quickly got the hang of how things worked and wrote a module to push data into a database. Hopefully I'll find some time to do some initial benchmarking on reading the data to see how things actually perform.
Didn't get any 520 work done last week. Pretty flat out playing catch up at the moment after my time off. Slowly getting back on top of assignments and lectures.
Haven't done a weekly report in a while due to being sick. Luckily I kept working on my 520 so I'm not horribly behind.
Firstly I made some pretty dramatic changes to the main code structure to improve performance. This seems to have paid off, making improvements to not only the performance but also the flow of the internal code.
All the database implementation is done now. I've made all the module code as agnostic as possibly so any module can be loaded without changes to the main code base. In the end I used sqlalchemy for the database abstraction layer. The framework is incredibly powerful and makes it easy to add other features at a later date.
The forwarding of the data had to be re-engineered when I restructured the main code so this needs to be re-implemented. Hopefully that's not going to be too much of a task at this point.
Due to being sick I have a fair amount of University work to catch up on so I'm not sure how much I'll be able to get done next week.
Started pulling things together over the past few weeks. The modules are now all integrated into the main application. Some of the forking and threading code needs some TLC but things are at least functional at this point. The main problem is scalability. I'm starting a thread for each data source which falls over when you follow a few hundred data sources. I need to be more clever about identifying resources and ensuring I'm not reading from them more than once for multiple streams. I may also require some changes/additions to the AMP api to allow me to request multiple data items in a single request.
Next Brendon wants me to stream the data (from multiple sources) in order. I'm just restructuring code so I can get at the time codes together, however I've been sick on and off this week which hasn't helped progress. After that I'll try and get the database implementation going.
This week I didn't manage any more on my 520. I was ill on Tuesday which pretty much put me out of action for a few days. So in the end I barely got anything done all week.
Last week was more productive however. I manage to get a daemon going which accepts connections from a unix domain socket. At this point it expects an ID and timestamp and then feeds back all data for the source from that timestamp onwards. It then leaves the connection open and sends any new data as it arrives. I also implemented a simple command line and a way to restart the modules without restarting the whole server which should avoid having to restart the whole daemon all the time.