This is a multi-part series
A good development team will already have a battery of tests in place to ensure their code is correct and can be refactored safely. But if performance is a feature, it’s critical to have good performance tests in place. There are multiple issues to consider when writing performance tests
Benchmarking Isn’t Always Obvious
You may not be testing what you think you’re testing! Modern computers and programming environments introduce non-obvious complexity. Take some time to educate yourself about
- Compilation pauses. If you’re working in a language with a Just-in-Time compiler (JIT), make sure you’re not measuring the time it takes to compile the function you’re calling. This can take milliseconds, or even seconds, if it loads very large libraries the first time it runs. Since many benchmarks run a small test several times to get an average result, one effective technique is simply to only start measuring after the first 1 or 2 iterations is finished.
- Garbage Collection pauses. Garbage Collection is a fact of life, and it’s irresponsible to ignore it in benchmarks. At least consider whether you want to measure it or not, and take steps to quantify it, or force it, or prohibit it, during your benchmark, as appropriate.
- Caching. Because the memory hierarchy is so complex in modern machines, you can find yourself chasing ghosts during benchmarking. It’s no good measuring the time to load and parse a file in a loop. The first iteration will pay the price for one or more disk seeks, disk loads, buffer copies, and the like. Later iterations will look almost free as the operating system hands you the same buffers again and again, without consulting the disk. If this is what you’re counting on, then good for you. But that’s not always so. Measure the jobs separately. Defeat per-library caches (like HDF5) by running multiple processes. Turn off caching in the OS for a while. But make sure you know it’s there.
- Cache and memory bus effects. Your CPU can do much faster work on datasets that fit in its L1 or L2 cache than it can against data that’s found in main memory. Benchmarking against an artificially small or large dataset can be very misleading, even by orders of magnitude.
Don’t Measure Someone Else’s Problem
We wrote a simple RPC protocol to work across WebSockets for a project we worked on. None of the existing ones had nice support for the kinds of operations we felt were important. It was (and is ) simple, used the lovely Reactive Extensions to handle multi-threading simply (no bugs found yet), but we wanted to make sure it was fast enough in real life. We can run all the benchmarks we want against some of the internals (by mocking the transport layer, for instance), but this doesn’t tell us how things work against a real network card.
So we send a bunch of ping/pong messages back and forth between the app server and a browser and figure out the average round trip time. Perhaps it’s obvious, but we also run a TCP “ping” at about the same time and subtract this average time from our benchmark. Since we’re working across a TCP connection, we’re never going to get any faster than this.
We’re working on benchmarking visualization of some relatively large datasets. But to look at a big dataset, first you have to load it. It’s no fair penalizing the rendering package for the sins of our data format, or the speed of a network disk. So we have one set of benchmarks to look at data loading, then another for rendering, assuming all the data is already in memory. In truth, a final solution may be able to do better than the simple sum of these numbers, by overlapping rendering and I/O, but measuring that is looking a lot more complexity, so we start with something simple.
Measure For Long Enough To Notice
Machines have precise timers on them these days, yes. And sometimes microseconds matter. (Surprisingly often in some of the places we work.) But your CPU is a noisy place — we’ll get to that in a minute. Measuring one call to your function may be trying to fit someone for a suit as he jogs past you. Run it a few thousand times. Or don’t even measure just your function — measure the larger service call it’s in aid of. Measure at least a few milliseconds, if not longer.
Measure in a Quiet Place
Your team is full of testing zealots, and every commit to your CI server runs hundreds or thousands of great unit tests for correctness. In fact, a lot of those tests probably run on the same machine that’s doing your build. In parallel. As far as correctness goes, that’s fine. But that test you wrote after carefully considering caching, the storage hierarchy, JIT pauses, and everything else? Now you’re trying to measure a teaspoon of water in a thunderstorm. That build server is probably opening and closing hundreds of files a second, dealing with thousands of network packets, and spinning CPUs like mad.
You need to find a way to mark your performance tests as such. (Consider, for instance, an NUnit Category.) And then run them in a quiet place to they have some repeatability. You’ll probably have to set aside their own machine. (Physical, if you can get it. Who knows what else is going on on that oversubscribed virtual machine.) And they’ll have to run one at a time. With careful bounds. Set up a script so that a network proxy always simulates 5% packet loss, 250ms round-trip time, and 5Mbps bandwidth. Always run it on a 2.5GHz CPU. Configure the machine to only use 4GB of RAM. Whatever it takes. Otherwise you’re just measuring everything else going on in the system.
Performance is a Probability Distribution
Even with the best of preparation, it’s common for the same test on the same data to yield performance numbers that differ by large numbers — 10-100% is common, especially on virtualized hardware. It may also be unreasonably difficult to always test on the same hardware. Instead of having a performance test “fail” because it’s too slow, instead log one or more figures into a central database, along with the machine configuration and information about the dataset being tested. Then monitor this database for any test that jumps out of a band you determine empirically — perhaps one or two standard deviations.
Depending on what in your system you’re measuring, you may want to look at more than just an average number. If your average query time for a typical user request is 10ms, but 5% of them take 4000ms, this might not be okay. In this case, you may want to actually calculate some distribution of results, say the P10/P50/P90 of the last 10 days of results, and judge based on that.