AMP- the active measurement project is a system for making active measurements. It is deployed at most Universities in New Zealand and most of the non-Telco ISPs. It has a large number of built in tests. Performance measurements from the public system are available at http://erg.cs.waikato.ac.nz
The NLANR AMP active measurment project was lead by Tony McGregor. At that timeAMP was the largest and most widespread active measurement system. It was designed for high performance research and education networks, especially the US Internet2 networks. It was deployed by the research and education networks in 11 countries (USA, Canada, Taiwan, Norway, Finland, Australia, Thailand, Japan, Ireland, Hungary and Korea). There were approx 140 measurement points worldwide.
Spent some more time working on measured. Tests will now be forked and
run (currently just running touch or ping to check it works), with a timer
scheduled to kill any that run too long. Successful tests remove the timer
once they complete - catching the SIGCHLD from the test lets me do all the
Tested it briefly on an emulation machine with 1000 tests scheduled
simultaneously every 20 seconds. Led to discovering a few small bugs with
the signal handling. After fixing them it all seems to run well, as long
as the watchdog timeout for hung tests is not too short (there isn't
always enough cpu time to go around). Everything works fine with slightly
fewer tasks or a slightly longer timeout.
Had another discussion with Shane about how we should structure tests and
started fleshing out a skeleton/example test. Basing it on a similar
structure to how Maji loads its various decoders etc, with lots of shared
objects that register various properties of the test when they are loaded.
Spent some more time reading bits of honours reports before they were
Updated addressing the KAREN AMP machines so they would continue to work
with recent network changes. In the process of doing so, discovered that
CFEngine would no longer update certain sites and spent quite a while
trying to debug it. It was failing to authenticate server keys properly,
which was fixed by forcing it to refetch the (exactly the same,
identical) key. Not impressed that it is acting flakey over something like
Started work on a new implementation of AMP using some of the ideas we've
been talking about. Currently I'm working on a reimplementation of
measured using libwandevent. At this stage it can read the old format of
schedule file and creates a new timer event for each one, runs a dummy
function when the time arrives and reschedules itself afterwards.
Continued working with Nathan to get smokeping data successfully into the
event detection system. I generated some random data to fill the
historical buffers and then continued to run it over live data, which
generated a small number of plausible looking events. I'm now looking into
the scalability and resource usage of this as it seems a little higher
than it should be. Also polished the dashboard graphs slightly, changing
them to use more sensible axis and better resolution data.
Spent some time with Richard, Tony and Shane thinking about the future
direction of AMP. We've got some good ideas and have a whiteboard full of
initial planning for the work that needs to be done.
Read draft introductions to a number of 520 reports and gave some
hopefully useful feedback. Everyone seems to be on the right track so far,
looking forward to reading more.
Tried to make the generated alerts more efficient and more effective by
very slightly delaying the actual alerting - doing so means that the alert
can contain any other events that arrive immediately after the triggering
event. It also now sends me emails for certain event thresholds, but I
broke the live import of AMP data so need to fix that before I can get
more than the emails generated by my test data.
Started trying to make the information presented in the default web
interface a bit more concise and relevant to what is going on right now.
Trying to use a few graphs to give an initial overview of the recent data
while keeping the ability to go look at everything in detail as you can
The AMP deployment on the NLNOG RING was mentioned during a talk at RIPE
about the RING along with screenshots and links back to WAND. The slides
look pretty good and I think it went well.
Fixed the bug in the tput test that would sometimes cause it to refuse
connections for a minute when it was meant to be re-establishing a new
test connection. It was erroneously waiting for more data when there was
no more to follow, so wouldn't continue until select timed out. Updated
the NZ mesh with all the fixes from the last couple of weeks.
Worked on the backend for the event detection web interface to use a more
flexible and secure database abstraction. Made a few small changes to the
web interface to try to hide information that wasn't always needed, but
still make it available if required.
Investigated all the historical event groupings and found a few rare cases
where it wasn't doing the right thing due to the order in which events
arrived or due to some missing common attributes. Came up with an approach
to sometimes rebuild groups as needed to minimise their number and get
Fixed a bug in the AMP web API that would give incorrect traceroute data
for the last measurement bin in certain situations and was causing issues
with the path change event detection algorithms. Also, after running an
AMP client with the threading fixes for a week on some of the machines
most often affected by the bug I'm pretty confident that it's fixed.
Fixed the group membership checks using common path information to
properly group events based on all items having a shared attribute. I'm
quite happy with the contents of the new groups, they make good sense and
can help show underlying problems in intermediate networks that aren't
immediately obvious from looking at just sources/destinations.
Started putting together a database schema and web interface for a very
simple alerting system using the event groups detected.
Found what looks to be a threading bug in AMP measured that has been
troubling me for a while. Test threads check that the nametable is up to
date before running but it was possible for them to deadlock on accessing
the file. I've had a fixed version of the code running for a couple of
days on some of the machines that were most often affected and have yet to
see the problem again. Hopefully that's fixed!
Fixed a couple of small bugs in the AMP matrix tooltips that were
triggering events on child elements without the appropriate attributes, so
weren't displaying information.
Wrote a program to insert common path information into the event database
to use for grouping events. Testing so far with this data shows that
fewer, larger groups of events are being created. Some of the membership
is a little bit questionable, so am know in the process of having it
describe the reasoning behind creating each of the groups.
Ran some more tests on the IPv6 packet filtering in the AMP ICMP test and
it does indeed appear that the errors are due to packets arriving between
the socket being opened and the filter being applied. That makes most of
the warnings much less worrying, and I've lowered the priority on those
that I can confirm aren't an issue. While investigating this I also found
a situation where various test resources weren't being freed in the
traceroute test if they involved IPv6 addresses. Fixed that as well.
Finished updating the protocol between the different parts of the event
detection process to use the new protocol design. Also changed it from
using local unix sockets to run across the network, as our data sources
will likely be on different machines to the eventing system. Socket input
for the time series data is also now supported rather than only using
Updated the sample web scripts that display event information to work with
the new database schema to confirm that everything is still working as it
Pushed out the AMP matrix changes to the NLNOG RING. Also investigated
colouring cells based on current performance vs historical performance
rather than raw latency values, which was a request they had.
Short week this week due to being in Wellington for Thursday and Friday.
While I was there I caught up with Jamie and Sam Russell at REANNZ for a
chat about AMP and perfSONAR deployments on the network. There should be a
lot of new monitors going in shortly and it would be great if we could run
both measurement platforms.
Spent some time investigating error messages that have been showing up
lately in amplet logs. It appears there is some weirdness happening with
raw icmp6 sockets receiving packets that should have been filtered out by
a socket option. Reading through the kernel source it looks like filters
are doing exactly what they should be doing and I know believe it's due to
packets arriving and being buffered in the time between the socket being
created and the filters being set.
Changed the tooltips in the matrix display to all be fetched via ajax
calls, so none of that data is sent to the client initially. This should
speed up page generation (no need to fetch data for the last week) and
shrink the raw page size futher. Will hopefully deploy and test this on
the NLNOG RING matrix shortly.
Sat down with Shane and went over what we need to do to get our event
detection programs integrated. The protocol used between data fetching,
detection and eventing needs to be updated slightly as there is more
information needing to be shared and magic numbers from testing to be
changed into real data. Started work on updating the protocol to match
what is required and updating the database schema to match.
Integrated the path change detection into the main detection code and
updated the main code path to deal properly with the slight differences
between a wider variety of data - traceroute, latency, byte counts, etc.
Did a bit of maintenance on the NZ AMP mesh and KAREN weathermap as well -
updated some addresses and got IPv6 addresses for the Citylink and
Netspace AMPlets which is neat.
Spent some time improving the performance of the NLNOG RING AMP matrix
page - with tens of thousands of cells the page got rather large and slow.
I've culled the individual tooltips down to one reusable one, drastically
reducing the number of DOM elements on the page as well as reducing the
size of the raw HTML. It's still a monster but is almost becoming
manageable. Next step will likely be to move all the tooltip data to an
AJAX call rather than embedding it in the page.
Fixed up the getCommonPath function in AMPcentral to better fetch data
from the desired time period. The ending condition for the time period was
using an incorrect value which resulted in using much longer periods. This
now gives me correct path data through the web interface which I can use
for event detection and hopefully smarter grouping of events. Added
database support for dealing with common attributes between sources and
destinations, now need to collect the data.
Rewrote my AMP data sampling program to properly sort all data by time
rather than by source/destination pair, and to deal properly with fetching
multiple data types in a single run.
Started looking at using topology data to generate more datapoints to help
group events on. Hopefully should be able to group events between sites
that share common paths (at this stage I'm planning on starting with the
AS path) as well as those that share sources and targets. As part of this
added an event detector to alert on major path changes between sites and
realised that there appears to be a bug in the AMP code to determine
common paths. Spent some time trying to track it down and it looks to be
due to counting the sample time period incorrectly, which I'm now trying
Figured out the cause of the AMP data interface module crashing on newer
php/apache. An incorrectly sized variable was being used in the c portion
to receive data from the php portion and along the way it was clobbering
something it shouldn't have. I'm sure the compiler warned about this last
time, but not in this case.
Spent some more time working on building useful groups of events for
RTT/loss data. I'm trying to find a compromise between including all
events that happen about the same time and grouping only those events that
are obviously related, while allowing events to be in multiple groups
where that makes sense. Some of these issues are coming about because my
sample data extraction program doesn't guarantee strictly increasing
timestamps in the warm-up phase while fetching historical data.
Tidied up some error messages in the icmp test in AMP where non-echoreply
responses were being incorrectly examined for the embedded triggering
packet. It should now properly index into those packets and record the
correct error type codes.
Noticed that sometimes the AMP tput test was failing to run in both
directions on some nodes and tried to investigate why. Running the tests
manually works, but scheduling them through AMP often fails to get the
return path test to run. Looks like there is some sort of timing issue
where the connection takes a long time to close and this prevents it from
being re-established in the other direction (the single threaded server is
still waiting for close() to return). Have yet to figure out an answer to
Spent some time with Shane, Brad and Jamie poking at the Network
Diagnostic Tool (NDT) used by perfsonar, mlab and as part of the
nzbt.org.nz broadband test. Some of the results we were getting weren't of
the quality we were expecting, so we put together our own little test lab
to see how it works. Our initial tests using a virtualised server couldn't
sustain gigabit speeds across the network bridge in one direction, despite
working fine in the other (and NDT performed less than half as good as
iperf). With two physical, directly connected machines we finally managed
to get the expected TCP performance but the extra analysis that NDT
performed was still bogus - it reports network limits that are much lower
than they actually are (and lower than what the test just observed!), RTT
values that are 500 times larger than they actually are, etc.
Updated ampcentral to build with a newer gcc, had missed this when I
rebuilt the amplet packages last week. Sent all that off to NLNOG RING
which will hopefully solve the problems they had getting it running.
Put some of my data into the web system Shane wrote for evaluating event
detection. Through that I found a couple of of events being generated that
shouldn't have been due to a previous event still being active. Spent some
time reworking the detector classes to limit these multiple events from
different instances of the same detector from being triggered in close
proximity as they add now new information.
Worked on some ldap scripts to add new voodoo users into appropriate
groups. Tested them successfully on a machine configured very nearly the
same as voodoo. Will hopefully give them a run early next week and see how
well they work in the real environment.
Expanded on some of the install instructions for ampcentral in response to
some new users having issues. There was lots missing about new database
configuration that needs to be done to make sites and data available via
the web interface. Also built new Ubuntu and Debian amplet packages for
them to use. Had a few issues with the SSL API changing slightly and new
gcc being much more pedantic, but it builds and runs now.
Updated the display of event groups to show all events from other sites
that are related to help give a feel for what is going on. Found and fixed
a few issues with recording the timing of event groups when starting with
a fresh database and importing historical data - they should now start at
the time of the first event in the group and end at the time of the last
Fixed the http2 test in AMP to properly share the DNS cache between
simultaneous connections which means it no longer performs unnecessary
lookups for the same name. The sharing interface in libcurl actually works
Tried to build new amplet packages including the recent changes, but ran
into some problems with libraries when building in my lenny buildroot.
Autoconf/make is meant to build a particular binary with an extra library
that the rest don't need, but this doesn't make it through to the Makefile
Jamie put together new RJ45-DB9 serial connectors for the emulation
network, so I created some sensible minicom configs for all the machines,
should be just as easy to use now as the old system with the Cyclades
terminal server was. Also set up udev on my linux image to force a
consistent order of the network interfaces that matches the way they are
Had more trouble with the emulation network than expected. Built a custom
ramdisk using proper tools (mkinitramfs etc) and a default Debian kernel
rather than using entirely custom ones. This worked fine, but after
installing an old image the machines would refuse to boot. Turns out that
the disks used to be part of a RAID and didn't have a useful MBR (and
frisbee didn't fix this). Built a master machine to create images from,
built a new Squeeze image and am now trying to convince frisbee to send
the entire image. I'm thinking a new version of frisbee may be required.
Tried to track down why the http2 test was performing DNS queries multiple
times for the same hostnames. Even with all the DNS cache sharing options
set in libcurl it will repeat requests, unless I force IPv4 only. A vague
line in a changelog looks like this might be fixed, but too recently for
the change to make it into Debian.
Finished putting together basic historical usage data for KAREN, it shows
a nice up and to the right trend.
Added some new http2 test destinations to the main AMP test schedule.
Started running them on Massey in response to a query about web
performance and in doing so found and fixed a few display bugs. Had
another look at using the logs from the test to generate waterfall graphs
of http connections (using http://www.softwareishard.com/har/viewer/) and
found a few cases where libcurl might not be behaving as expected when
Spent some time talking with Shane and planning out how we can fit
everything together in a useful fashion for the MSI project.
Started investigating the best way to aggregate measurements from the last
few years of the KAREN weathermap to look at the growth of the network.
Watched some of the streamed presentations by Josh on Openflow, looking
Got the new graphs for latency and traceroute data rolled out onto the
live ampcentral site. Updated the install scripts to properly create the
new databases for API keys etc required to access the data in the new
style. Put this and the most recent amplet tarballs online on the WAND
research software site so they are now available, should write a quick
blog post advertising them now.
Tried further to track down the cause of incorrect data to be reported
sometimes in the new combined dns/icmp test. In the process fixed a couple
of small bugs in the options parsing, but the main one refuses to show up
while being observed.
Watched a couple of streamed sessions of the Future Broadband conference
that seemed interesting. A lot of businesses at the conference seem quite
cautious and risk averse with UFB, not wanting to invest in it till it has
proven itself in some way, while user advocacy groups can't wait to get
online. Could be some interesting developments here in the next while.
Built AMP on FreeBSD for the first time in quite a while and found that
only very minor changes were required to build successfully. Getting the
self checks running in Linux and FreeBSD took a bit more effort with
things like data formats having changed without the checks being updated
to match. Updated a lot of tests, contact details, documentation etc to
useful current values.
Started setting up a staging point to get the new traceroute graphs
working in ampcentral and deployed. A few changes needed to work with
newer php as well as tidying up the way some of the data formats are used.