Little Performance Explorations: C++
Sebastian Good

This the third post in a series, which started with a description of the problem, and continued with an F# benchmark.

After achieving mediocre results trying my benchmark in F#, I figured it was time to write some C++ to see how fast it could be without doing anything really exotic. Don’t get me wrong — the F# code was quick to write, and performed within about a factor of 5 of what I thought could be done without extraordinary effort. In a lot of applications, that’s when you stop. But, once you start benchmarking, it’s hard to stop. You never know when the next tweak will earn you 10%, or 50%. It’s like a slot machine. But I digress.

Writing the benchmark — considering the heart of it is purloined from C — is trivial.

[sourcecode language="c"] #include <iostream> #include <fstream> #include <chrono> #include <arpa/inet.h> using namespace std; typedef std::chrono::high_resolution_clock Clock; float toIeee(uint32_t ibm) { ibm = ntohl(ibm); uint32_t fr; /* fraction */ int32_t exp; /* exponent */ int32_t sgn; /* sign */ /* split into sign, exponent, and fraction */ fr = ibm; sgn = (int32_t)(fr >> 31); /* save sign */ fr <<= 1; /* shift sign out */ exp = (int32_t)(fr >> 25); /* save exponent */ fr <<= 7; /* shift exponent out */ if (fr == 0) { /* short-circuit for zero */ exp = 0; goto done; } /* adjust exponent from base 16 offset 64 radix point before first digit to base 2 offset 127 radix point after first digit */ /* (exp - 64) * 4 + 127 - 1 == exp * 4 - 256 + 126 == (exp << 2) - 130 */ exp = (exp << 2) - 130; /* (re)normalize */ while (fr < 0x80000000) { /* 3 times max for normalized input */ --exp; fr <<= 1; } if (exp <= 0) { /* underflow */ if (exp < -24) /* complete underflow - return properly signed zero */ fr = 0; else /* partial underflow - return denormalized number */ fr >>= -exp; exp = 0; } else if (exp >= 255) { /* overflow - return infinity */ fr = 0; exp = 255; } else { /* just a plain old number - remove the assumed high bit */ fr <<= 1; } done: /* put the pieces back together and return it */ uint32_t ieee = ((uint32_t)(fr >> 9)) | ((uint32_t)(exp << 23)) | ((uint32_t)(sgn << 31)); float result; memcpy(&result, &ieee, sizeof(float)); return result; } int main(int argc, char **argv) {  size_t fileHeaderLength = 3200 + 400;  size_t nTraces = 64860;  size_t headerLength = 240;  size_t nSamples = 1501;  size_t traceLength = headerLength + (nSamples * 4);  size_t fileLength = fileHeaderLength + (nTraces * traceLength);  float* samples = new float[nSamples];  for(size_t s = 0; s < nSamples; ++s) { samples[s] = 0.0; }  size_t bufferLength = fileHeaderLength + (nTraces * traceLength);  uint8_t* buffer = new uint8_t[bufferLength];  ifstream s("filt_mig.sgy", ifstream::binary);  s.read(reinterpret_cast<char*>(buffer), bufferLength);  auto t1 = Clock::now();  for (size_t n = 0; n < nTraces; n++)  {    size_t traceOffset = fileHeaderLength + (n * traceLength);    for (size_t s = 0; s < nSamples; ++s)    {      uint32_t ibmSample = *(reinterpret_cast<uint32_t*>(buffer + traceOffset + s*4));      samples[s] += toIeee(ibmSample);    }  }  auto t2 = Clock::now();  // if you don't use samples, the optimizer  // is smart enough to not run it...  cout << samples[4] << endl;  cout << chrono::duration_cast<chrono::microseconds>(t2 - t1).count() / 1000.0 << "ms" << endl;  delete[] buffer;  delete[] samples; } [/sourcecode]

Compile it at maximum optimization and run it.

[sourcecode language="bash"] $ clang++ -std=c++11 -O3 bench.cpp $ ./a.out 463.474ms [/sourcecode]

OK, that’s a nice result! We’re only 40% slower than our totally arbitrary target of 1/10th of memory bus speed. And 1.6 times faster than the F# code we had to brute force. I can imagine that some more effort — or an even cleverer compiler like the Intel ones — would yield better results. One day perhaps it’ll be worth twisting the code around to achieve some sort of vectorization. But it was pretty satisfying that relatively little effort was expended to achieve it.

Had my benchmark involved something with a little structure, like complex numbers, it would have gotten interesting quickly. Define my own structures? Array of structures? Structures of arrays? Copy constructors? Move constructors? Immutable structures? Those choice get verbose, fast, in C++. Having written enough C++ code, I’m aware I managed to pick a benchmark that kept things really simple. No contortions were required to get good performance. And the nastiness of violating type safety to convert our magic integer bit pattern into a float? That turns out to be surprisingly interesting, as detailed pretty well in this StackOverflow post. The standard-compliant way to do it so as not to introduce type punning issues, is with memcpy, which fortunately compilers understand well enough to generate good assembly for.

So, we have a safe high level language (F#) coming in at 750ms or so — with some frustrating work — and C++ coming it at 460ms, with basically no effort.

Up next: Julia.

Postscript

Once you start benchmarking, it’s hard to stop. A little additional work a few days after I originally wrote this article resulted in a better time yet. The work was easy enough to do: loop unrolling, a few short circuits. And using gcc 5.0 got an even better result than clang 5.0 on this OS X Mavericks machine. Final runtime was extremely respectable. Intel compilers didn’t do any better than LLVM in this case, interestingly.

[sourcecode language="bash"]
$ ./a.out 
388.668ms 
[/sourcecode]
RECENT POSTS FROM
THIS AUTHOR