Often when optimizing simulation software, we run into a classic roadblock: Floating point precision is not accurate enough, we must use double precision.
Of course, there are many instances in real physical problems where, indeed, a solution requires double precision computation. However, we should be careful not to fall into a trap - the performance benefits of using 32bit vs 64bit datatypes is significant.
Particularly on GPUs (Graphics Processing Units), where memory and bandwidth charge a higher premium, 32bit operations are much more desirable (see also Machine Learning's recent push to use 16 bit operations).
But we do have one fascinating edge on our side when using GPU calculations: they are often more accurate than CPU calculations.
No, I don't mean to say that the GPU has a more accurate FPU (Floating Point Unit). Instead, due to the massive number of GPU threads and corresponding small units of work, GPU threads more often deal with numbers of comparable sizes and are thus, more accurate.
Classical Sum of an Array
Take as a simple example reducing the sum of an array. A classic sequential CPU loop will look something like:

As the sum gets larger, so does the error. This is a simple consequence of how floating point numbers are represented with a fixed number of bits.
GPU Sum of an Array
For a GPU, we must write our code to use as many threads as possible. At a very high level, our code may look something like the code below. For a real implementation, one may look here.

Visually, the GPU algorithm looks something like this:

Due to the tree like nature, the elements sent to the FPU tend to have equivalent sizes and the sum operation, therefore, has less error.
Practical Example using Python
Here I will use numpy and pycuda to briefly demonstrate the point. First, looking at sum:

True sum: 4999471.5791
CPU 32bit precision: 4.99947e+06 error: 0.579096186906
GPU 32bit precision: 4999471.5 error: 0.0790961869061

Let’s look at another practical example, performing the dot product:

True dot product: 3332515.05789
CPU 32bit precision: 3.33231e+06 error: 200.557887515
GPU 32bit precision: 3332515.0 error: 0.0578875150532

Conclusions
Of course, there are many other variables at play in a real world scenario - library implementation, compiler flags, etc. Perhaps in this simple example you may find the magnitude of the error is too small to be exciting.
Yet, the GPU solution is “closer to double”. The pattern of the GPU implementation - with thousands of threads and tree-style reduction repeats in the real world - often leads to more accurate results. In some cases, the difference is enough to drop the double precision requirement to the benefit of performance.
Does your problem require some serious number crunching? Send Ryan an email!