Testing the New Zealand Broadband Test
Last week REANNZ made an announcement launching their new New Zealand Broadband Test website. We heard a few reports of inconsistent or unexpected results when compared to expected or speedtest.net results and thought it worth a look to see how well it did in fact perform. If there were any problems then hopefully we could work with REANNZ to get them fixed and improve the experience for their users. The main part of this testing ended up involving getting the Network Diagnostic Tool (NDT) working satisfactorily in a lab environment, though our extra vantage point did help guide some improvements in the NZBT infrastructure.
The first few times we ran the test using the web interface on a very well connected machine we saw varying results: 27.7Mbps down and 43.45Mbps up followed by results varying between 33.3Mbps and 4.86Mbps down (upload speeds were more stable). A little bit erratic, but perhaps some of that could be attributed to our own shared connection being congested. The throughput test part of NDT only runs for 10 seconds in each direction, which isn't a lot of time to recover should the connection experience loss. Depending on when loss occurs this could drastically reduce the observed bandwidth.
The next step was to run it on a machine connected to the REANNZ Network at 1Gbps that should have enough capacity to the test server (hosted at Victoria University, connected to WIX) to get a better result than a few tens of megabits per second. For ease of testing we also switched to the command line version of the NDT client at this point. The first test with the new setup got 68Mbps down and 55Mbps up, which was an improvement but still well below what was expected. Further tests gave similar results, varying by about 10Mbit in each direction. One interesting thing about using the command line client is that it gives easier access to more detailed information about the test, and in this case it kept reporting:
The slowest link in the end-to-end path is a 100 Mbps Full duplex Fast Ethernet subnet
A 100Mbps link in the path could explain the poor performance, as congestion from multiple tests would limit the throughput anyone could achieve. We were to later find out that there was indeed a 100Mbps SFP in the path (which has since been replaced). At the time though we were hesitant to believe what NDT was telling us as a lot of the other information didn't look correct, so our next step was to try to verify NDT in the lab.
NDT in a Virtualised Lab Environment
The Web100 kernel patches required to run NDT make it a nuisance to deploy, so we used a KVM instance running a perfSONAR live CD. With a physical machine directly connected at 1Gbps as our client and running the command line version of NDT, results improved again. At 140Mbps download and 460Mbps upload it was still well shy of the 900+ Mbps we were expecting but much more than we had got previously. In the process of tuning TCP on the client (which didn't improve anything, all the buffers etc were big enough) we also changed the virtual NIC that KVM was using to virtio and moved from a tap to a bridge interface. This boosted NDT up to almost 900Mbps upload and 200Mpbs down. Getting better! But why the imbalance?
Worried that there might be something broken in our test set up we fell back to trusty iperf to test our connection. After a bit of iperf testing the upload speed looked great at 966Mbps but download was still lagging behind at 740Mbps. Something must be wrong, but weird that it affects iperf a lot less than it affects NDT, and why only in one direction? To remove one more variable we tested between the virtual machine and it's host, not using any physical network and still got the same results. None of the machines involved appeared to be struggling with the load of creating, sending and receiving packets - is there something about the network bridge that limits throughput? The results across 10 test runs for iperf and NDT (both client to server (c2s) and server to client (s2c)) are shown in Figure 1. Both achieve higher throughput from the client to the server, but iperf is generally higher and more consistent than NDT.
One other worrying thing here was that NDT reported extra information about the quality of the link that was clearly bogus. It varied between estimating our link speed at 45Mbit (despite pushing much more data through the link) all the way through to 10Gbps. It made claims that the throughput was limited by the size of the send buffer and then immediately afterwards described the maximum possible throughput based on the buffers to be 10 times what was achieved. The NAT detection complains about address modification by a middle box because it tries comparing hostnames with IP addresses and fails to get an exact match.
NDT in a Physical Lab Environment
To get a fair test we had to use physical machines for both the client and the server. Again the machines were directly connected at 1Gbps and a quick test with iperf confirmed that they were getting very nearly that in both directions (~960Mbps). Running NDT across the same link got 941Mbps in both directions - finally symmetrical, consistent across multiple runs and approaching maximum possible throughput! So it appears NDT can work at gigabit speeds and give almost the expected results. In an effort to trip it up we introduced cross-traffic at varying levels but it still performed well and was not unduly affected.
The extra web100 information that NDT reports should still be taken with a grain of salt however. Estimated round trip time is off by more than a factor of 500, theoretical network limits are half of what is achieved, host buffer limits reported are less than the observed throughput.
- In a perfect environment, NDT reports throughput that is close to what should be theoretically possible.
- The extra web100 information reported by NDT was almost always incorrect and should not be relied on.
- NDT running inside KVM doesn't perform as well as iperf does. Neither performs well when the virtual host is generating the data, but NDT is much worse. What do they do differently (smaller writes?) to get such different performance? And what could we do to our KVM set up to improve overall performance?
- In terms of predictive power there doesn't seem to be anything standing in the way of NDT giving accurate results for available TCP bandwidth.
- Tests to www.nzbt.org.nz are looking a lot more stable and predictable now that they aren't limited by a 100Mbps link.