A Quick Note on GPU Accuracy and Double Precision

Simulations are often run with double precision to reduce error - at the cost of performance.

Contact us

A Quick Note on GPU Accuracy and Double Precision

Simulations are often run with double precision to reduce error - at the cost of performance.

Fill out form to continue
All fields required.
Enter your info once to access all resources.
By submitting this form, you agree to Expero’s Privacy Policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Often when optimizing simulation software, we run into a classic roadblock: Floating point precision is not accurate enough, we must use double precision.

Of course, there are many instances in real physical problems where, indeed, a solution requires double precision computation. However, we should be careful not to fall into a trap - the performance benefits of using 32bit vs 64bit datatypes is significant.

Particularly on GPUs (Graphics Processing Units), where memory and bandwidth charge a higher premium, 32bit operations are much more desirable (see also Machine Learning's recent push to use 16 bit operations).

But we do have one fascinating edge on our side when using GPU calculations: they are often more accurate than CPU calculations.

No, I don't mean to say that the GPU has a more accurate FPU (Floating Point Unit).  Instead, due to the massive number of GPU threads and corresponding small units of work, GPU threads more often deal with numbers of comparable sizes and are thus, more accurate.

Classical Sum of an Array

Take as a simple example reducing the sum of an array. A classic sequential CPU loop will look something like:

As the sum gets larger, so does the error.  This is a simple consequence of how floating point numbers are represented with a fixed number of bits.

GPU Sum of an Array

For a GPU, we must write our code to use as many threads as possible.  At a very high level, our code may look something like the code below.  For a real implementation, one may look here.

Visually, the GPU algorithm looks something like this:

Due to the tree like nature, the elements sent to the FPU tend to have equivalent sizes and the sum operation, therefore, has less error.

Practical Example using Python

Here I will use numpy and pycuda to briefly demonstrate the point.  First, looking at sum:

True sum: 4999471.5791
CPU 32bit precision: 4.99947e+06 error: 0.579096186906
GPU 32bit precision: 4999471.5 error: 0.0790961869061

Let’s look at another practical example, performing the dot product:

True dot product: 3332515.05789
CPU 32bit precision: 3.33231e+06 error: 200.557887515
GPU 32bit precision: 3332515.0 error: 0.0578875150532

Conclusions

Of course, there are many other variables at play in a real world scenario - library implementation, compiler flags, etc.  Perhaps in this simple example you may find the magnitude of the error is too small to be exciting.

Yet, the GPU solution is “closer to double”.  The pattern of the GPU implementation - with thousands of threads and tree-style reduction repeats in the real world - often leads to more accurate results.  In some cases, the difference is enough to drop the double precision requirement to the benefit of performance.

Does your problem require some serious number crunching? Send Ryan an email!

User Audience

Services & capabilities

Project Details

Technologies

Ryan Brady

March 27, 2018

A Quick Note on GPU Accuracy and Double Precision

Simulations are often run with double precision to reduce error - at the cost of performance.

Tags:

Often when optimizing simulation software, we run into a classic roadblock: Floating point precision is not accurate enough, we must use double precision.

Of course, there are many instances in real physical problems where, indeed, a solution requires double precision computation. However, we should be careful not to fall into a trap - the performance benefits of using 32bit vs 64bit datatypes is significant.

Particularly on GPUs (Graphics Processing Units), where memory and bandwidth charge a higher premium, 32bit operations are much more desirable (see also Machine Learning's recent push to use 16 bit operations).

But we do have one fascinating edge on our side when using GPU calculations: they are often more accurate than CPU calculations.

No, I don't mean to say that the GPU has a more accurate FPU (Floating Point Unit).  Instead, due to the massive number of GPU threads and corresponding small units of work, GPU threads more often deal with numbers of comparable sizes and are thus, more accurate.

Classical Sum of an Array

Take as a simple example reducing the sum of an array. A classic sequential CPU loop will look something like:

As the sum gets larger, so does the error.  This is a simple consequence of how floating point numbers are represented with a fixed number of bits.

GPU Sum of an Array

For a GPU, we must write our code to use as many threads as possible.  At a very high level, our code may look something like the code below.  For a real implementation, one may look here.

Visually, the GPU algorithm looks something like this:

Due to the tree like nature, the elements sent to the FPU tend to have equivalent sizes and the sum operation, therefore, has less error.

Practical Example using Python

Here I will use numpy and pycuda to briefly demonstrate the point.  First, looking at sum:

True sum: 4999471.5791
CPU 32bit precision: 4.99947e+06 error: 0.579096186906
GPU 32bit precision: 4999471.5 error: 0.0790961869061

Let’s look at another practical example, performing the dot product:

True dot product: 3332515.05789
CPU 32bit precision: 3.33231e+06 error: 200.557887515
GPU 32bit precision: 3332515.0 error: 0.0578875150532

Conclusions

Of course, there are many other variables at play in a real world scenario - library implementation, compiler flags, etc.  Perhaps in this simple example you may find the magnitude of the error is too small to be exciting.

Yet, the GPU solution is “closer to double”.  The pattern of the GPU implementation - with thousands of threads and tree-style reduction repeats in the real world - often leads to more accurate results.  In some cases, the difference is enough to drop the double precision requirement to the benefit of performance.

Does your problem require some serious number crunching? Send Ryan an email!

User Audience

Services

Project Details

Similar Resources

Serverless ML

You can now deploy your models and get real-time scalable results without ever having to provision a server. Let me show you how I did it.

Watch Demo

Software Craftsmanship in Context

Software quality matters. Learn from a real use-case how we follow the Software Craftsmanship method to push a hard project forward.

Watch Demo

Have You Had Your JanusGraph Tuneup?

Tuning JanusGraph is not for the faint of heart, read these tips and tricks to help you get the most out of your JanusGraph project.

Watch Demo

Company for a Cup of Joe: Finding Your Tribe

Combining traditional search techniques with graph algorithms to efficiently find subgroups within data.

Watch Demo