Recently I ran into a rather interesting deadlock in an HPC development project. Tracking down the issue required a bunch of interesting tools, but I’ll save some of the story for next time.
After logging into a deadlocked node, I fired up the first line of defense – GDB. OK, lets get a backtrace of where all my threads are:
thread apply all bt
2220 threads! Something must be going wrong with the third party scheduler. But the operating system can handle even more concurrent threads, so although I can’t condone spawning a few thousand threads in a process, I can’t point my finger at the scheduler and go on vacation.
The real problem here – how am I supposed to dig through a few thousand stack traces?
A typical gdb stack trace contains a set of frames, where a frame consist of a function address and then debugging info such as the function name, arguments and line number, depending on the amount of information available to the debugger. In a lot of HPC code, the same functions tend to get called by many various threads with different arguments, so the function address and callstacks tend to be quite similar. Thus, we can leverage the function addresses as unique identifiers and reduce the total number of backtraces we really need to read.
First, lets dump all of the backtraces to a log file using gdb’s logging. I’m on host 12 of 25, so I’ll log to “ghosts12.log”:set logging onset logging file ghosts12.logthread apply all bt
Next, I’ll use a python script I wrote to parse through the log file:
With that, I get the far more pleasant result:
Ah! Now that’s much more manageable, I can probably find something useful if I only need to read through 27 callstacks.
Stay tuned for the next chapter!