There’s an old standby which tells us that
A supercomputer is a device for turning compute-bound problems into I/O-bound problems.
If that’s the case, then perhaps concepts from freshman chemistry are useful when looking at HPC problems. In this case, a rate limiting step is the step which constrains the rest of the reaction. We are looking at a problem these days where we would like to compute a 12MB extract from 600GB of data, and if we had our druthers, we’d do it in 100ms. There are lots of pieces of this pipeline, but it’s worth considering just the raw bandwidth required, as we would consider flux in a chemical reaction. For us, that’s 6TBs-1. Not a lot of devices can push that many bytes per second. Wikipedia has a handy list of device bandwidths we can look at to figure out just what kinds of things might get close. One device that they don’t list there is the speed of GPU RAM, but there’s an entry over at Video Card that has some details about graphics card RAM, which has pretty fabulous bandwidth.
The short answer is that only RAM gets close to that kind of bandwidth. I didn’t mention that the 600GB is a subset of about 5TB of data we’re going to keep live in RAM, and we can’t afford that much GPU RAM. It appears the maximum theoretical bandwidth of main system RAM is on the order of 20GB-1. That suggests that if we really have to look at all 600GB of the data, and it’s all in RAM, then we still need a minimum of three hundred RAM buses to have a hope of answering our question in that time period — if it’s perfectly parallelizable!
There are probably other important rate limiting steps, like the speed of the floating point units, or vector processing power, but this is at least a useful to get on the table for our problem. There are also ways to make better use of that bandwidth, such as compressing data in RAM. What if we could compress 10x? Or use a heuristic to estimate 50% of the data? These are potential improvements, but the basic number is an excellent sanity check.