Little Performance Explorations: F#
Sebastian Good

This post is the second in a series, starting here.

To read a SEG-Y file full of seismic traces and do anything interesting with it, you usually have to convert IBM/370 floating point numbers to IEEE representation. As an interesting exercise, I decided to write the conversion in several languages to see how easy it was to achieve good performance. (Quite arbitrarily, I decided “good” meant “as fast as a naive implementation in C++”.)

I’m working with the 32-bit IBM floating point format which looks roughly like this

IBM.Float32

And will convert it to the 32-bit IEEE floating point format, which is similar.

IEEE.Float32

.

There are a few things to worry about on the way, but fortunately the Amazing Electric Internet has preserved a lovely piece of volunteer C code that explains how it’s done. Basically you pull out the sign, exponent and fractional parts (svn, exp, fr in the code below), fuss with them a bit to deal with the fact they have different conventions and sizes, and put them back together.

To make sure we don’t confuse our IBM and IEEE floats, I added a type

[sourcecode language="fsharp"] type IbmFloat32 =  struct    val Bits: uint32  end  new (b:uint32) = { Bits = b } [/sourcecode]

I’ve never been very satisfied with just a type alias like type IbmFloat32 = uint32 as they don’t really stop you from hurting yourself. I know enough to make sure this type is a value type so I’m not heap allocating hundreds of millions of them.

With that out of the way, here is a simple F# transliteration of the translation from IBM to IEEE, getting rid of a goto by defining a local function, but keeping the mutable implementation largely intact. (The ability of a functional language like F# to admit mutable local variables by exception is in this case a great strength: I don’t have to turn the nice algorithm I’ve already found on its head to use it.)

[sourcecode language="fsharp"] let ieeeOfPieces fr exp sgn =  uint32(fr >>> 9) ||| uint32(exp <<< 23) ||| sgn  |> ToFloat32 let ToIeee (ibm:IbmFloat32) : float32 =  let mutable fr = ibm.Bits  let sgn = fr &&& 0x80000000u // save sign  fr <- fr <<< 1 // shift sign out  let mutable exp = int(fr >>> 25) // save exponent  fr <- fr <<< 7 // shift exponent out  if fr = 0u then // short-circuit, return properly signed zero     ieeeOfPieces 0u 0 sgn  else    // adjust exponent from base 16 offset 64 radix point    // before first digit to base 2 offset 127 radix point    // after first digit    // (exp - 64) * 4 + 127 - 1    let mutable exp = (exp - 64) * 4 + 127 - 1    // (re)normalize, 3 times max for normalized input */    while fr < 0x80000000u do exp <- exp - 1; fr <- fr <<< 1    if exp <= 0 then // underflow      let fr =        if exp < -24 then          0u // complete underflow, return properly signed zero        else          fr >>> -exp // partial underflow, return denormalized number      ieeeOfPieces fr 0 sgn    elif exp >= 255 then // overflow - return infinity      ieeeOfPieces 0u 255 sgn    else // just a plain old number - remove the assumed high bit      ieeeOfPieces (fr <<< 1) exp sgn [/sourcecode]

The benchmark driver itself is boring; just load a sample SEGY file into memory and time how long it takes to process all the samples out of the file. (We don’t want to time disk I/O.) The benchmark contains some hardcoded values which we’d normally obtain by parsing file headers, not something that’s interesting in this context. The file I converted is freely available, and is about 405MB.

[sourcecode language="fsharp"] let b v shift = (int v) <<< (8 * shift) let readBigEndianInt32 (a:byte[]) i =  b a.[i] 3 |||  b a.[i+1] 2 |||  b a.[i+2] 1 |||  b a.[i+3] 0  |> uint32 let time f =  let sw = System.Diagnostics.Stopwatch()  sw.Start()  f()  sw.Stop()  sw.Elapsed.TotalMilliseconds let bench () =  let fileHeaderLength = 3200 + 400 // SEG-Y text and binary headers  let nTraces = 64860 // number of traces in the file  let headerLength = 240 // each SEGY trace has a 240 byte header at the beginning, which we ignore  let nSamples = 1501 // Each trace has 1501 4 byte samples, each an IBM floating point number  let traceLength = headerLength + (nSamples * 4) // = 6244 length in bytes of each trace, including its header  let fileLength = fileHeaderLength + (nTraces * traceLength)  let buffer = File.ReadAllBytes "filt_mig.sgy"  let mapi' f (a:'T[]) =    for i in 0..a.Length-1 do a.[i] <- f i a.[i]    a  let convert() =    seq { for n in 0 .. nTraces-1 -> fileHeaderLength + n * traceLength + headerLength }    |> Seq.fold      (fun (samples:float32[]) traceOffset ->        samples |> mapi' (fun i s -> s + ToIeee(IbmFloat32(readBigEndianInt32 buffer (traceOffset + i*4)))))      (Array.zeroCreate nSamples)    |> ignore  printf "%f ms" (time convert)  0 [/sourcecode]

Gathering statistics is naturally expressed as a fold — we take a zero trace (Array.zeroCreate nSamples) and fold all the traces into it by summing them. While I didn’t want to stray too far into premature optimization, it seemed better to fold the sum in place instead of generating a new array on each trace, so I whipped up a quick function — mapi' — to map the array in place.

You might have written it differently, but it’s a fairly natural way to write an accumulation in F#. This isn’t exactly a complicated algorithm, but it’s easy to throw together, aided by the type system. Besides, most of the work should really be in the IEEE conversion.

About which, a few notes. F# is a type safe language. There’s no clear way to convert a 32-bit pattern from the uint32 we’ve been manipulating as to the float32 we want it to be. The bright lights at Microsoft saw fit to give us a BitConverter.DoubleToInt64Bits method (and vice versa), but not BitConverter.SingleToInt32Bits. So I wrote my own with an ugly hack.

[sourcecode language="fsharp"] open System.Runtime.InteropServices [<StructLayout(LayoutKind.Explicit)>] type IntFloat =  struct    [<FieldOffset(0)>] val mutable f: float32    [<FieldOffset(0)>] val mutable i: uint32  end let ToFloat32 (i:uint32) = let mutable x = IntFloat() in x.i <- i; x.f [/sourcecode]

How did it do? Well, with .NET there are several compiler options. I won’t get into the details, but it turned out the one that gave me the fastest possible results after optimizations I’ll outline below was mono, running with the llvm code generation option. Though the standard .NET x64 CLR was just about as good.

Without any optimization work, I got

[sourcecode language="bash"] $ mono --llvm FSharp.exe 2355.363700 ms [/sourcecode]

That’s pretty disappointing compared to our nominal target of 320ms (1/10th of the memory bus speed). We’re off by a factor of seven.

As a side note, it’s a shame I had to write a time function in the first place. It’s not hard, but it seems like most environments provide one.

And as another side note, yes, this benchmark violates one of the rules of benchmarking in a JIT language — it doesn’t execute all the code once minimally to make sure functions are compiled. That probably exaggerates the cost of the function by a few milliseconds. But given the time taken, that’s very much in the noise.

The first thing I tried was an F# trick meant primarily to allow polymorphic code to be dispatched without polymorphic overhead: inline methods. (In fact contrasting F# inline methods, C++ templates, and Julia multiple dispatch would make for a fascinating blog series. For about 9 people. But I digress.) But on the theory that I should encourage the compiler to inline some of these methods, I started scattering inline keywords about on the major functions: ToFloat32, ieeeOfPieces, and ToIeee. (You can imagine how simple this benchmark should be when it’s machine code… inline enough stuff and it’ll mostly cancel away into simple while loops!)

[sourcecode language="bash"] $ mono --llvm FSharp.exe 1825.163100 ms [/sourcecode]

25% better. That’s not much to write home about. It’s also disappointing to have to scatter inline about, as it’s not meant for performance, but for static polymorphism.

Disappointing. Where could we go from here? We could get access to a math library with strong support for vector operations — that would help with the sample folding, adding whole 1500 element arrays at once. But now we’re rearranging our thought process to accommodate the compiler. How about we just get rid of all the fancy folding and just do the dumb, brute force forloop in the heart of the code (leaving the inlines in place).

[sourcecode language="fsharp"] let convert() =  for n in 0 .. nTraces-1 do    let traceOffset = fileHeaderLength + n * traceLength + headerLength    for s in 0 .. nSamples-1 do      samples.[s] <- sampl [/sourcecode][sourcecode language="bash"] $ mono --llvm FSharp.exe 748.274600 ms [/sourcecode]

Hey, that’s much better! Now we’re just off our target by a little over a factor of 2. And there are plenty of people who would argue it’s easier to read, too.

But it’s not an encouraging result. The Seq.fold and mapi' business is a little over-fancy in this case, but it’s the sort of level of abstraction you’re typically after in F#. If the compiler can’t reduce the first code I wrote down to the later code, and it costs me a factor of 2.2, it really destroys my incentive to use these abstractions in more complicated situations. I’ll always be wondering when I’ll suddenly lose the next factor of two.

With enough work, I’m sure that higher level constructs could be coaxed into similar performance. In fact, I’d love for some F# experts to help critique this simple-minded attempt. Would this mean F# is still a good place for us to be, using high level constructs for program coordination, I/O and the like, but able to drop into an imperative mode (in the same language!) for code that’s performance critical?

That depends on how fast the C++ version is. We’ll find out next week.

Spoiler: it’s faster.

A response

A few people in the F# community took some interest in this post. First Ryan Riley wondered what was going on, then Lincoln Atkinson took a look and reproduced similar results. The nice thing was he identified the source of much of the surprising slowness as excess memory allocation. I’ll hypothesize this could be blamed on the F# compiler itself, for allocating closure records instead of making calls to static methods, or on the JIT, for not doing an escape analysis on these closures. Either way, once you make your code a little more imperative, you get something like the result I showed above.

There is still a substantial gap between this result and the result we’ll find with other environments, and my guess is this is a code generation issue, i.e. instruction selection and scheduling, but I’m not an expert in this area either!

RECENT POSTS FROM
THIS AUTHOR