This is a multi-part series
A good development team will already have a battery of tests in place to ensure their code is correct and can be refactored safely. But if performance is a feature, it’s critical to have good performance tests in place. There are multiple issues to consider when writing performance tests
You may not be testing what you think you’re testing! Modern computers and programming environments introduce non-obvious complexity. Take some time to educate yourself about
We wrote a simple RPC protocol to work across WebSockets for a project we worked on. None of the existing ones had nice support for the kinds of operations we felt were important. It was (and is ) simple, used the lovely Reactive Extensions to handle multi-threading simply (no bugs found yet), but we wanted to make sure it was fast enough in real life. We can run all the benchmarks we want against some of the internals (by mocking the transport layer, for instance), but this doesn’t tell us how things work against a real network card.
So we send a bunch of ping/pong messages back and forth between the app server and a browser and figure out the average round trip time. Perhaps it’s obvious, but we also run a TCP “ping” at about the same time and subtract this average time from our benchmark. Since we’re working across a TCP connection, we’re never going to get any faster than this.
We’re working on benchmarking visualization of some relatively large datasets. But to look at a big dataset, first you have to load it. It’s no fair penalizing the rendering package for the sins of our data format, or the speed of a network disk. So we have one set of benchmarks to look at data loading, then another for rendering, assuming all the data is already in memory. In truth, a final solution may be able to do better than the simple sum of these numbers, by overlapping rendering and I/O, but measuring that is looking a lot more complexity, so we start with something simple.
Machines have precise timers on them these days, yes. And sometimes microseconds matter. (Surprisingly often in some of the places we work.) But your CPU is a noisy place — we’ll get to that in a minute. Measuring one call to your function may be trying to fit someone for a suit as he jogs past you. Run it a few thousand times. Or don’t even measure just your function — measure the larger service call it’s in aid of. Measure at least a few milliseconds, if not longer.
Your team is full of testing zealots, and every commit to your CI server runs hundreds or thousands of great unit tests for correctness. In fact, a lot of those tests probably run on the same machine that’s doing your build. In parallel. As far as correctness goes, that’s fine. But that test you wrote after carefully considering caching, the storage hierarchy, JIT pauses, and everything else? Now you’re trying to measure a teaspoon of water in a thunderstorm. That build server is probably opening and closing hundreds of files a second, dealing with thousands of network packets, and spinning CPUs like mad.
You need to find a way to mark your performance tests as such. (Consider, for instance, an NUnit Category.) And then run them in a quiet place to they have some repeatability. You’ll probably have to set aside their own machine. (Physical, if you can get it. Who knows what else is going on on that oversubscribed virtual machine.) And they’ll have to run one at a time. With careful bounds. Set up a script so that a network proxy always simulates 5% packet loss, 250ms round-trip time, and 5Mbps bandwidth. Always run it on a 2.5GHz CPU. Configure the machine to only use 4GB of RAM. Whatever it takes. Otherwise you’re just measuring everything else going on in the system.
Even with the best of preparation, it’s common for the same test on the same data to yield performance numbers that differ by large numbers — 10-100% is common, especially on virtualized hardware. It may also be unreasonably difficult to always test on the same hardware. Instead of having a performance test “fail” because it’s too slow, instead log one or more figures into a central database, along with the machine configuration and information about the dataset being tested. Then monitor this database for any test that jumps out of a band you determine empirically — perhaps one or two standard deviations.
Depending on what in your system you’re measuring, you may want to look at more than just an average number. If your average query time for a typical user request is 10ms, but 5% of them take 4000ms, this might not be okay. In this case, you may want to actually calculate some distribution of results, say the P10/P50/P90 of the last 10 days of results, and judge based on that.