Little Performance Explorations: ISPC
Sebastian Good

(This post is fifth in a series, starting with a statement of the problem, and continuing with implementations in F#C++, and Julia)

After having such fun with Julia, it still seemed a shame that their compiler wasn’t able to automatically vectorize our code — even when we begged it to with the @simd macro. How much work would it take to use a SPMD compiler? In fact it was a post on the julia-users mailing list that led me to the Intel SPMD Program Compiler (ISPC). It’s a bit CUDA-like… extensions to C that encourage vectorized code and take care of low-level details for you. How hard would it be to get a speedup on our silly little benchmark?

The documentation for ISPC is short and sweet. It doesn’t take but a moment to read it. I converted the heart of my code to be ISPC compliant in an hour, and spent another hour messing with options to get the fastest I could. I suspect a fire-breathing vector programming wizard could do better, but it was promising for such a short amount of work. The best part was that the ISPCcompiler generates simple C-callable library code for immediate linkage into your larger program.

The first thing we have to do is separate the bit we’re going to vectorize from the rest of the program. For our benchmark, that means the bit that converts the traces from IBM format to IEEE format. We could do the whole file in one function, or each trace. I decided to start with just managing one trace at a time. ISPC creates object files linkable with a extern "C" calling convention, so we just arrange for our C++ driver to say this

[sourcecode language="C"] extern "C" {  extern void convert_samples(size_t nSamples, float samples[], uint32_t traceWords[]); } // ... // inside the main benchmark // ... uint32_t* traceWords = reinterpret_cast<uint32_t*>(buffer + fileHeaderLength + headerLength); for(size_t nTrace = 0; nTrace < nTraces; ++nTrace) {  convert_samples(nSamples, samples, traceWords);  traceWords += traceLength / sizeof(uint32_t); } [/sourcecode]

(Note: I reused the trick from Julia of just looking at our traces as sequences of uint32_tvalues, rather than a byte stream, to avoid lots of casting. It made the base C++ benchmark a little faster on my machine, at about 440ms.)

The ispc code looks very similar to the original C, but there are a few notable changes. First, ISPC is not happy with most uses of goto, so I had to remove it as I did with the Julia and F# versions of this code by lifting the relevant portion into a function. (Interestingly, making this change in the C++ version slowed it down.) Secondly, there is no ISPC standard library like ntohlto swap the endianness of our input stream (so far as I could find), so I had to add that with bit twiddling operations. (I bet with a Cherry Coke and a few hours of learning a lot more about shuffle instructions, one could do better.) As a very minor point, ISPC defines its own types for known-integer sizes, going with int16, int, and int64, instead of <stdint.h>‘s int32_t, etc.

The largest change was marking variables as uniform — telling ispc that their value does not change during the parallel execution of the function. You can read up on what this means. Recall that the objective is to get a body of code to run across every lane of the vector registers in your CPU at once, e.g. with every + in your code turning into 8 simultaneous additions. uniform helps provide guarantees to the compiler about which of your variables will be the same across all those lanes, and which will be changing.

With those changes figured out the new benchmark core looks like this.

[sourcecode language="C"] typedef int int32_t; typedef unsigned int uint32_t; uint32_t bswap(uint32_t u) {  return ((u&amp;0xFF)&lt;&lt;24) | (((u&gt;&gt;8)&amp;0xFF)&lt;&lt;16) | (((u&gt;&gt;16)&amp;0xFF)&lt;&lt;8) | (((u&gt;&gt;24)&amp;0xFF)&lt;&lt;0); } // need to avoid the goto float ieeeOfPieces(uint32_t fr, int32_t ex, int32_t sgn) {  uint32_t ieee = ((uint32_t)(fr &gt;&gt; 9)) | ((uint32_t)(ex &lt;&lt; 23)) | ((uint32_t)(sgn &lt;&lt; 31));  // this is a great ispc built-in; elegant like Julia's reinterpret  return floatbits(ieee); } float toIeee(uint32_t ibm) {  ibm = bswap(ibm);  uint32_t fr; /* fraction */  int32_t exp; /* exponent */  int32_t sgn; /* sign */  /* split into sign, exponent, and fraction */  fr = ibm;  sgn = (int32_t)(fr &gt;&gt; 31); /* save sign */  fr &lt;&lt;= 1; /* shift sign out */  exp = (int32_t)(fr &gt;&gt; 25); /* save exponent */  fr &lt;&lt;= 7; /* shift exponent out */  if (fr == 0)  { /* short-circuit for zero */    exp = 0;    return ieeeOfPieces(fr, exp, sgn);  }  /* adjust exponent from base 16 offset 64 radix point before first digit to base 2 offset 127 radix point after first digit */  /* (exp - 64) * 4 + 127 - 1 == exp * 4 - 256 + 126 == (exp &lt;&lt; 2) - 130 */  exp = (exp &lt;&lt; 2) - 130;  /* (re)normalize */  while (fr &lt; 0x80000000)  { /* 3 times max for normalized input */    --exp;    fr &lt;&lt;= 1;  }   if (exp &lt;= 0)   { /* underflow */    if (exp &lt; -24) /* complete underflow - return properly signed zero */      fr = 0;    else /* partial underflow - return denormalized number */      fr &gt;&gt;= -exp;      exp = 0;  }  else if (exp &gt;= 255)  { /* overflow - return infinity */    fr = 0;    exp = 255;  }  else  { /* just a plain old number - remove the assumed high bit */    fr &lt;&lt;= 1;  }  /* put the pieces back together and return it */  return ieeeOfPieces(fr, exp, sgn); } export void convert_samples(  uniform size_t nSamples,  uniform float samples[],  uniform uint32_t traceWords[]) {  for (size_t s = 0; s &lt; nSamples; ++nSamples)  {    samples[s] += toIeee(traceWords[s]);  } } [/sourcecode]

So how does it do?

[sourcecode language="bash"] $ ./bench_ispc 2699.15ms [/sourcecode]

Oh. Terrible compared to the 440ms we’re getting from clang++. That’s odd, since they’re both based on llvm and we’ve used roughly the same code. As usual, it’s a good idea to pay attention to the compiler warnings:

[sourcecode language="bash"] $ make bench_ispc clang++ -O3 -std=c++11 -Wall  -c bench.cpp ispc -O3 convert_samples.ispc -o convert_samples.ispc.o Warning: No --target specified on command-line. Using default system target "avx1.1-i32x8". convert_samples.ispc:76:3: Warning:  Undefined behavior: all program instances are writing to the same location!  samples[s] += toIeee(traceWords[s]); convert_samples.ispc:76:3: Performance Warning: Scatter required to store value.  samples[s] += toIeee(traceWords[s]); clang++ -O3 -std=c++11 -Wall  bench.o convert_samples.ispc.o -o bench_ispc [/sourcecode]

We’re violating the assumptions of the compiler by writing to sample[s] like this. There’s no telling what sort of bizarre instructions it’s putting together. The naive copy/paste is wrong. You need to understand which of your variables are being vectorized and which aren’t. ISPC has a special construct for looping: the foreach loop. There are a lot of knobs to tweak for more complex code, but for our for-loop across all the samples in a trace, it couldn’t be simpler.

[sourcecode language="C"] export void convert_samples(uniform size_t nSamples, uniform float samples[], uniform uint32_t traceWords[]) {  foreach (s = 0 ... nSamples)  {    samples[s] += toIeee(traceWords[s]);  } } [/sourcecode]

It compiles with no warnings, and executes…

[sourcecode language="bash"] $ ./bench_ispc 240.17ms [/sourcecode]

… a LOT faster. Almost twice as fast as the C++ version. Well, that was easy? Not content to stop there, I made a small tour through the performance guide to see if there were any hints I was missing without dramatically re-plumbing the program.

For some reason, the first thing I tried was mucking with the width of the vectors (ispc calls them gangs) that the compiler would use. By default it generates code for 8 32-bit wide values (avx1.1-i32x8), as this is the actual hardware width of the registers. In some cases, oversubscribing the lanes can result in faster code. I tried avx1.1-i32x16 and got a noticeable 10% bump, bringing us to about twice as fast as the C++ code.

[sourcecode language="bash"] $ ./bench_ispc 215.09ms [/sourcecode]

They say this can happen when your code doesn’t produce a lot of register pressure and so oversubscribing can make use of the extra registers on the chip. This seems reasonable: the original toIeee can probably be scheduled into 4 or 5 registers. Other code may not work quite the same. Or perhaps it just slurps from memory more efficiently. Without a more detailed profiler, I’m just guessing.

Then I included inline keywords (which made no difference) and converted my if and whilestatements to cif and cwhile statements (which did). Wait, what? Recall that when executing code in the vector/SPMD environment, you’re best off if all your lanes take the same path. If your code is data dependent, e.g. only double numbers which are odd, then you force the CPU to do different things in different lanes, which comes at an overhead. The CPU can still manage this through the use of a mask, which allows the vector processor to selectively execute your operation (e.g. an add) in some lanes and not others.

It turns out ispc will generate slightly more efficient code if you can promise it that most of the time all lanes in the vector will make the same choice at an if statement. That’s impossible for the compiler to guess, and I guess branch prediction in vector mask units is not a high art yet, so you the programmer get to tell it. It turns out that for seismic trace data being converted this way, most of those units will indeed be making the same choices at the same places. Look at the ifand while statements in that code: they have to do with distinguishing zeros and non-number floats (e.g. infinities). Well behaved seismic shouldn’t have infinities and NaNs and what have you, so we can assume those tests will essentially never pass. Zero shows up a lot in seismic, but in streaks, where an entire trace or section of a trace is blanked out. Inside real trace data it’s pretty rare: sample values are always dancing around zero but rarely hit it exactly on the head. So by suggesting to ispc that these branches usually take the same path, we get a nice speed bump of about 10%.

A few more bits of tinkering not worth getting into here — unrolling a loop, changing the algorithm slightly to return plain zeros instead of signed zeros, using a 2 dimensional foreach loop over all traces and all samples, vectorizing the common case that all samples are well-formed zeros — got us another 40%, down to a very respectable.

[sourcecode language="bash"] $ ./bench_ispc 130.499ms [/sourcecode]

Amazing. Yes, I spent half an hour fiddling with some optimizations, but nothing out of the ordinary for anyone trying to optimize a hot loop. ISPC never really got in the way, and I had vectorized speedups the whole way.

With Julia making it so easy to call C code, it seems like Julia plus ispc might be a highly practical method to getting potentially abstract, yet extremely efficient numeric code generated. (That is, until the Julia geniuses figure out how to enable this sort of programming using disgustingly intelligent macros, or LLVM can do the same thing.)

The downsides of this approach are the need to add a whole new compiler toolchain to your environment (though binary versions for major platforms are easily downloadable) and to muck with your core loops to make them ISPC friendly. And since some compilation options are very specific to your algorithm (e.g. 16 lanes instead of 8!), you may need to compile each of your ‘tight loops’ separately to get full advantage. You may need to do it multiple times for multiple platforms!

Conclusions

Squeezing maximum performance out of your CPU is a dark art. I’m still fairly confident someone smarter than me could get this code to run a bit faster, as we’re still running 3.8 times slower than the maximum memory bandwidth. (cat filt_mig.sgy > /dev/null runs in 66ms, or half of memory bandwidth.)

But the tools for getting this performance are increasingly accessible. I was disappointed in F# and the .NET JIT at not being able to achieve the performance of LLVM for simple numeric code, and impressed at Julia for matching it. Granted, .NET was only twice as slow or so as C++ and that might be good enough for a lot of our code. And yes, when Julia was slow it was *much* slower. It’s not clear cut. Writing the C++ and ispc was easy, but only because we were in the core number-crunching parts of the code — memory management and multi-threading were nowhere to be found. We never touched multi-core programming — all of this happened on one core. Higher level languages can make that sort of parallelism a little easier.

But to me, the biggest takeaway was to help set some intuition about what’s possible. Speed doesn’t always matter, but I think every programmer should know that surfing through 440MB of data and doing a few dozen math operations per integer can take 130ms on a single core. And that’s without exotic data structures or fancy tricks. High level languages and libraries like the TPL make it too easy to say “oh, it’s too slow, let’s add some threads”. First do a level set and figure out how far you are from theoretical maxima and ask whether you can use your CPU more efficiently. You might be amazed at the result.

p.s.

ISPC has a simple plug-and-play tasking model for using multiple cores really easily. Another 30 minutes of coding to break my conversion routine into taskable pieces resulted in all 4 cores being used, thereby saturating my memory bus. Yes, it’s still imperative C-like coding, but it was easy. (Yes, I also spent 30 minutes finding nasty tricks to get rid of all the if statements in my code. It turns out there’s a way if you have a way of counting leading zeros in a word, an intrinsic on Intel chips!. But I digress.

At this point, we’re no longer playing fair with the other examples in this series — we didn’t go multi-core on those — but that’s kind of the point. For small, tight loops, ISPC is so easy to use it’s not playing fair! The final result?

[sourcecode language="bash"] $ ./bench_ispc 27.729ms [/sourcecode]

That’s 25% faster than I thought the memory bus could sustain. A pleasant surprise. Not every loop is this ‘easy’ to optimize — this one was embarrassingly parallel and involved essentially all integer math. But I’ll admit that when I started this exercise I didn’t think near half a gigabyte of data could be processed in under 30ms.

Sometimes I think, “well, Java/.NET/whatever is only 2 times slower than C++”. And that’s fine for most of your program. In fact, most of your program is probably pointer chasing and calling 3rd party libraries anyway. For various reasons, your Java program might be faster than an equivalent C++ program. But if your .NET or JVM program is 8 times slower than vectorized code? And your naive C++ code is 4 times slower? For your inner loops, can you really afford anything else?

RECENT POSTS FROM
THIS AUTHOR