Brendon Jones's blog
Started looking at using topology data to generate more datapoints to help
group events on. Hopefully should be able to group events between sites
that share common paths (at this stage I'm planning on starting with the
AS path) as well as those that share sources and targets. As part of this
added an event detector to alert on major path changes between sites and
realised that there appears to be a bug in the AMP code to determine
common paths. Spent some time trying to track it down and it looks to be
due to counting the sample time period incorrectly, which I'm now trying
Figured out the cause of the AMP data interface module crashing on newer
php/apache. An incorrectly sized variable was being used in the c portion
to receive data from the php portion and along the way it was clobbering
something it shouldn't have. I'm sure the compiler warned about this last
time, but not in this case.
I'm still not happy with the event groupings that I'm getting, so spent
some time looking further into the literature for ideas. A little bit of
this is caused by my test data fetching program presenting information
outside chronological order (Nathan should hopefully have his one working
soon and I can use that) but a lot is still due to making bad grouping
decisions. Found a few interesting ideas about comparing events for
similarity, but most require more attributes to compare on. Thinking I
should start investigating other attributes that I can add to the events -
information about paths etc.
Wrote up a blog post
our adventures with NDT, hopefully that will be of interest to a few
people. Might be worth looking deeper into it to see if we can track down
what was causing some of the performance issues.
Spent some time looking at various emulation network issues. update-grub
is sometimes using the wrong root partition, so running it often results
in an unbootable machine.
Last week REANNZ made an announcement launching their new New Zealand Broadband Test website. We heard a few reports of inconsistent or unexpected results when compared to expected or speedtest.net results and thought it worth a look to see how well it did in fact perform. If there were any problems then hopefully we could work with REANNZ to get them fixed and improve the experience for their users. The main part of this testing ended up involving getting the Network Diagnostic Tool (NDT) working satisfactorily in a lab environment, though our extra vantage point did help guide some improvements in the NZBT infrastructure.
Spent some more time working on building useful groups of events for
RTT/loss data. I'm trying to find a compromise between including all
events that happen about the same time and grouping only those events that
are obviously related, while allowing events to be in multiple groups
where that makes sense. Some of these issues are coming about because my
sample data extraction program doesn't guarantee strictly increasing
timestamps in the warm-up phase while fetching historical data.
Tidied up some error messages in the icmp test in AMP where non-echoreply
responses were being incorrectly examined for the embedded triggering
packet. It should now properly index into those packets and record the
correct error type codes.
Noticed that sometimes the AMP tput test was failing to run in both
directions on some nodes and tried to investigate why. Running the tests
manually works, but scheduling them through AMP often fails to get the
return path test to run. Looks like there is some sort of timing issue
where the connection takes a long time to close and this prevents it from
being re-established in the other direction (the single threaded server is
still waiting for close() to return). Have yet to figure out an answer to
Spent some time with Shane, Brad and Jamie poking at the Network
Diagnostic Tool (NDT) used by perfsonar, mlab and as part of the
nzbt.org.nz broadband test. Some of the results we were getting weren't of
the quality we were expecting, so we put together our own little test lab
to see how it works. Our initial tests using a virtualised server couldn't
sustain gigabit speeds across the network bridge in one direction, despite
working fine in the other (and NDT performed less than half as good as
iperf). With two physical, directly connected machines we finally managed
to get the expected TCP performance but the extra analysis that NDT
performed was still bogus - it reports network limits that are much lower
than they actually are (and lower than what the test just observed!), RTT
values that are 500 times larger than they actually are, etc.
Updated ampcentral to build with a newer gcc, had missed this when I
rebuilt the amplet packages last week. Sent all that off to NLNOG RING
which will hopefully solve the problems they had getting it running.
Put some of my data into the web system Shane wrote for evaluating event
detection. Through that I found a couple of of events being generated that
shouldn't have been due to a previous event still being active. Spent some
time reworking the detector classes to limit these multiple events from
different instances of the same detector from being triggered in close
proximity as they add now new information.
Worked on some ldap scripts to add new voodoo users into appropriate
groups. Tested them successfully on a machine configured very nearly the
same as voodoo. Will hopefully give them a run early next week and see how
well they work in the real environment.
Expanded on some of the install instructions for ampcentral in response to
some new users having issues. There was lots missing about new database
configuration that needs to be done to make sites and data available via
the web interface. Also built new Ubuntu and Debian amplet packages for
them to use. Had a few issues with the SSL API changing slightly and new
gcc being much more pedantic, but it builds and runs now.
Updated the display of event groups to show all events from other sites
that are related to help give a feel for what is going on. Found and fixed
a few issues with recording the timing of event groups when starting with
a fresh database and importing historical data - they should now start at
the time of the first event in the group and end at the time of the last
Updated my detectors to output events using the new protocol and hooked it
into the eventing tool. It now runs online and stores events in the
database rather than text files. Put together an example webpage to try to
expose most of the information from the database and see what it might be
interesting to do - presenting just the useful information in a tidy way
is going to be an interesting project. It's quite cool to be able to see
all events going on to and/or from every site, but a little overwhelming
in terms of the amount of data presented.
Sat down with Jamie and Shane and went over some of the sysadmin things
that we will be taking over. Got some ldap account management scripts to
update which should help bring me up to speed with how it all works.
Spent some time reading literature around aggregating or combining event
alerts to see what the current state of the art is. Lots of information
about describing combinations of events (e.g. if A followed by B followed
by (C or D)) or filtering repeated events, but not much on combining
similar events from multiple monitors/locations. I'll look into this again
I think, but for now my simple rules based on intersecting event sources
or targets within a few minutes of each other will suffice.
Wrote most of the structure for the event aggregation program and have it
working for my test input. Event information is accepted over a socket
from any number of sources and written to a database. Lists are maintained
of similar events (in time and location) which new events can be added to.
Event group information is also written to the database and updated as the
group membership expands.
After some discussion with Shane we've come up with a simple protocol to
communicate event information between the various detectors and the event
aggregator. I'm now in the process of updating the simple text based
protocol with the new one and getting my detectors to interoperate in a
live fashion (rather than through reading text files like the initial
Wrote some test code to try to group similar events together and then
determine whether or not to alert on the group. Simply grouping events
that happen within a couple of measurement periods of one another looks to
work well in the general case for the data I have, and increasing the
tolerances for tests to the same target or the same metric expands quite
nicely to also catch escalating problems. Using the knowledge gained from
that I started to put together a better system to process/merge events
online and write them to a database for ease of searching.
Also started to look at how different levels of reporting could work based
on number of events in a group, number of different locations involved,
reported severity levels etc. Most single events are unlikely to be
reported in any sort of urgent fashion, but will still be recorded for
historical analysis. Filtering out these means that the more serious
events can be reported in a more time critical way, to more important
Got the event markers in the flot graphs working, fetching on the fly and
caching as required. Built up a few webpages to try to get a feel for the
types, numbers, timing and locations of events and have each one linked
back to these graph of AMP data with event markers which makes exploring
the results of the event detection a bit easier.
Spent some time looking over the events in the graphs and adjusting the
algorithms slightly to try to exclude ones that aren't really events. Also
started to think about how events could be grouped to show those that
likely have a similar cause, and how it could be determined whether or not
to alert anyone about them.