Little Performance Explorations: Julia
Sebastian Good

(This post is the third in a series, starting with a description of the problem, and continuing with F# and C++ benchmarks of the same program.)

Julia is a new language based on LISP, built with scientific computing explicitly in mind. Perhaps because of this, I felt right at home from the start. I tried to do some idiomatic programming for this version of the benchmark, but because Julia is new, and because I’m new to it, it’s hard to know whether I succeeded or not.

The first Julia-specific feature I felt I had to take advantage of was the ability to define bits types, that is, custom types based on a bit-field of ‘arbitrary’ length. (‘Arbitrary’ means multiple of 8, but the manual suggests this might be changed in future.) When working with a new floating point representation, this seems perfect. I can keep my types straight. In other languages you may have to remember whether your uint32 was actually a uint32 or an IBM float. After writing this:

[sourcecode language="R"] bitstype 32 IbmFloat32 [/sourcecode]

Julia will keep me from confusing a Uint32 with a IbmFloat32.

Now how do I create one in the first place? Here again Julia charms, recognizing that while being type-safe is important, sometimes you just have to do some bit manipulation. Unlike F# which made us define an awkward overlapping union, or C++ (and C#) which made us abuse pointer dereferencing, Julia cuts right to the point and gives us a function called reinterpret, which does just what it says: reinterprets a block of memory as a different type.

Julia also embraces multiple dispatch, allowing you to define a function over and over and over again, with the appropriate specializations being figured out at run time — or compile time if the compiler is smart enough. The idiomatic way to allow conversions from one type to another is just to define another version of the function convert that takes and returns the right arguments! (The standard library ships with over 100 already.) So the core of our benchmark code looks really simple. With its functional mindset, the Julia function looks nearly exactly like the F# function.

(forgive the source code presentation below; our blog engine doesn’t have a built-in syntax formatter for Julia code.)

[sourcecode language="R"] import Base.convert function convert(::Type{Float32}, ibm::IbmFloat32)  ieeeOfPieces(fr, exp, sgn2) = reinterpret(Float32, uint32(fr >>> 9 | exp << 23 | sgn))  local fr = ntoh(reinterpret(Uint32, ibm))  local sgn = fr & 0x8000000 # save sign  fr <<= 1 # shift sign out  local exp= fr >>> 25 # save exponent  fr <<= 7 # shift exponent out  if fr == 0 # short-circuit for zero    ieeeOfPieces(0, 0, sgn)  else    # adjust exponent from base 16 offset 64 radix point before first digit to base 2 offset 127 radix point after first digit    # (exp - 64) * 4 + 127 - 1    exp = (exp - 64) * 4 + 127 - 1    while (fr < 0x80000000) != 0 # (re)normalize, 3 times max for normalized input      exp -= 1      fr <<= 1    end    if exp <= 0      # complete underflow, return properly signed zero      # OR partial underflow, return denormalized number      fr = exp < -24 ? 0 : (fr >>> -exp)      ieeeOfPieces(fr, 0, sgn)    elseif exp >= 255 then # overflow - return infinity      ieeeOfPieces(0, 255, sgn)    else # just a plain old number - remove the assumed high bit      ieeeOfPieces(fr << 1, exp, sgn)    end  end end [/sourcecode]

Note there are very few type definitions, except where we’re explicitly reinterpreting data, or specifying which datatypes we’re converting between. This is how LISP works — everything is dynamically typed. And they make it fast how? We’ll see.

Here’s the benchmark driver — just done at top level in the REPL. It didn’t take too much reading of Julia documentation to see that while we could write this benchmark using the functional style of our first F# attempt, they’d much rather we write ‘devectorized’ code, i.e. lots of for loops. No problem.

[sourcecode language="R"] const bpath = "filt_mig.sgy" const fileHeaderLength = 3200 + 400 # SEG-Y text and binary headers const nTraces = 64860 # number of traces in the file const headerLength = 240 # each SEGY trace has a 240 byte header at the beginning, which we ignore const nSamples = 1501 # Each trace has 1501 4 byte samples, each an IBM floating point number const traceLength = headerLength + (nSamples * 4) # = 6244 length in bytes of each trace, including its header const fileLength = fileHeaderLength + (nTraces * traceLength) function readRaw(path)  local f = open(path)  buf = read(f, IbmFloat32, div(fileLength, 4))  close(f)  buf end buf = readRaw(bpath) function bench(buf)  samples = zeros(Float32, nSamples)  for i = 1:nTraces    traceOffset = div(3200 + 400 + (i * 240), 4)    for s = 1:nSamples      samples[s] += toIeee(buf[traceOffset + s])    end  end  samples end [/sourcecode]

Easy to read, easy to compile. How’s the execution time? Now this is fun. Julia recognizes that hammering away on benchmarks isn’t a marginal activity. It’s a core obsession of scientific programming (well, after you get the answer right, of course.) Julia ships with a @time macro that gives you some basics.

[sourcecode language="bash"] julia> @time(bench(buf)) elapsed time: 75.174229856 seconds (58272190556 bytes allocated, 39.09% gc time) [/sourcecode]

Brutally, impressively, shockingly slow. Any more adverbs we can think of? Dismayingly? What in the world is going on? The biggest hint to me is that this thing allocated nearly 58GB of memory, and spent 40% of its time garbage collecting it. It doesn’t take too much deep thinking to realize that a decent implementation shouldn’t have to — once the file has been loaded — allocate more than nSamples * 4 bytes of memory. For this dataset, that’s 6004 bytes.

Had the authors of Julia not emphasized so strongly that they could achieve C++ speeds, I would be sorely tempted to just walk away here! But I’m new to the language, it’s new, and they claim we can do much better.

I have no idea what led me to wonder about this, but it turns out the inline function ieeeOfPieces is trouble. A good compiler will certainly recognize that this function captures no variable from its lexical context, so is effectively just a private static function, but apparently Julia doesn’t quite get this right yet. (There’s an open issue about this.) Let’s pull this to the top level and see what happens.

[sourcecode language="R"] ieeeOfPieces(fr, exp, sgn) = reinterpret(Float32, uint32(fr >>> 9 | exp << 23 | sgn)) function convert(::Type{Float32}, ibm::IbmFloat32)  # instead of where it was here  local fr = ntoh(reinterpret(Uint32, ibm))  local sgn = fr & 0x8000000 # save sign  # ... [/sourcecode]


[sourcecode language="bash"] julia> @time(bench(buf)) elapsed time: 2.56220917 seconds (1013340728 bytes allocated, 23.21% gc time) [/sourcecode]

Wow. They weren’t kidding. OK, easy thing to fix, a little annoying in terms of programming hygiene. Scientific programming — especially when aided by a lovely REPL like Julia’s — is all about creating lots of little functions. When you make the one big function that uses them, it’s nice to declare them inline so they don’t clutter your namespace. Let’s spot them that one since they’re just getting started.

2.6 seconds is still slower than our too-expensive F# attempt, and it’s 8 times slower than our made-up target of 325ms. Not impressive. But we really haven’t spent any time helping Julia out with the types of our code. We just slung a bunch of code together, and the documentation is very clear that they can only generate super efficient machine code when they have enough information to figure out types statically. (With this in mind, perhaps 2.6 seconds is pretty impressive!)

But where did Julia lose the type thread in our code? I can think of at least two things we’re doing that can be confusing in the world of integer arithmetic. We’re working with 32 bit quantities when the native word size is 64 bits. Perhaps there are automatic promotions and demotions between Int32 and Int64, and the runtime has to box these values to keep track. Also, bit twiddling is notorious for making it hard to figure out unsigned versus signed arithmetic. Are we also wandering around between Uint32 and Int32?

I ran into these issues getting the F# program to type check properly, as F# doesn’t automatically promote integer types, requiring lots of explicit int32 and uint32 operators. I could have cheated and just put all those explicit type annotations in the Julia program, but I wanted to see how Julia felt like helping us. After all, it’s not really statically compiled. I don’t get fun squiggles denoting type mistakes as I type.

The primary culprit — as in many language that allow polymorphism — is variables being boxed because they can contain values of multiple types. Julia documentation writers call this “type instability”. It can occur easily when two successive mutations or two branches of an if statement assign the same variable multiple types of values. We have a few such places in our code. One tool that can help is the TypeCheck module. I went whole hog and used the code_typed function which tells us all the types involved in our expression. It’s a daunting output (several dozen long lines), but the important answer is in the beginning.

[sourcecode language="R"] julia> code_typed(toIeee) 1-element Array{Expr,1}: :($(Expr(:lambda, {:ibm}, {{:sgn,:exp,:fr,:_var0,:_var3,:_var1,:_var2,:_var4,:_var5,:_var6,:_var7},{ {:ibm,IbmFloat32,0}, {:sgn,Uint32,18}, {:exp,Any,2}, {:fr,Any,2}, {:_var0,(ASCIIString,Type{Uint32}),0}, {:_var3,(ASCIIString,Type{Uint32}),0},{:_var1,(ASCIIString,Type{Float32}),0},{:_var2,(ASCIIString,Type{Float32}),0},{:_var4,(ASCIIString,Type{Float32}),0},{:_var5,(ASCIIString,Type{Float32}),0},{:_var6,(ASCIIString,Type{Float32}),0},{:_var7,(ASCIIString,Type{Float32}),0}},{}}, :(begin  # none, line 2: [/sourcecode]

Long story short: see that while the variable sgn is constrained to be Uint32, exp and fr are not constrained at all. They are marked as Any, which means there will be lots of boxing and unboxing depending on which branches are taken in our conversion function.

One excellent example is in this piece of code, which appears entirely harmless

[sourcecode language="R"] fr = exp < -24 ? 0 : (fr >>> -exp) [/sourcecode]

The problem is that if fr is a Uint32, so too will be (fr >>> -exp), but literal 0 is a Int64! In most statically typed languages, this would trigger a compile error, or a silent an automatic shortening (since 0 can be shortened safely). But in Julia it means fr is now doomed to hold boxed values so we can tell whether we have a Uint32 or a Int64. That’s really wasteful. The solution is simple: make both sides of the if statement return the same type.

[sourcecode language="R"] fr = exp < -24 ? zero(Uint32) : (fr >>> -exp) [/sourcecode]

And so it goes for a few other places in the code. In fact, it’s much easier to constrain your variables at the point of declaration up front, as these type instabilities will be caught at run time. (Even better would have been to make the variables immutable, but that’s a story for another day, since we inherited a mutable algorithm.) See the ::Type annotations below. We know that frand sgn should be unsigned, and exp should be signed.

[sourcecode language="R"] local fr::Uint32 = ntoh(reinterpret(Uint32, ibm)) local sgn::Uint32 = fr & 0x8000000 fr <<= 1 # shift sign out local exp::Int32 = int32(fr >>> 25) [/sourcecode]

After adding these annotations to a few key places in the code to get rid of “type instability”… we got a better benchmark.

[sourcecode language="bash"] julia> @time(bench(buf)) elapsed time: 0.483769418 seconds (6176 bytes allocated) [/sourcecode]

AHA!. No garbage collection, and within 4% of our C++ time. That’s amazing. We took a dynamically typed LISP, added 4 or 5 type annotations, and got essentially the same code as a top-flight C++ compiler. (Side note: I should update the Julia REPL or the the Light Table Julia plugin to highlight syntax when there are questionable types found.)

How difficult was it to achieve C++ performance? The lovely answer was: not long. I was new to the language, so it took a bit of bumbling around. And the tooling could certainly be slightly more mature: expressions introducing “type instability” should be easier to find. But it was pretty straightforward and I got excellent results. (I even tried a version without the IbmFloat32 type to see if I got a better benchmark without this distraction, but I didn’t. The abstraction cost me nothing!)

Like F#, I’m slightly uneasy, wondering just when I might get bitten, as the cost of inefficiently-typed code is hard to predict. But the environment was delightful, the primitives sensible, and the fact that I could achieve C++ speed that looked like simpler C++ code is a great comfort. I’ll be curious to see what happens when I build the next algorithm and haven’t built a strongly-typed version of it first. Will it be as easy to find the not-quite-typed-right parts? We’ll see, but for now, Julia stays on the short list because it delivers rapid development, C++ speeds, and a big built-in mathematically focused library.