Next: Parallelism Up: ParallelOut-of-core methods for Previous: Data Structures

Tree building and Sorting

Creating a tree of sib_groups from a random set of particles, and arranging that the sib_groups at each level are themselves ordered along a space-filling curve is our next problem. Fortunately, both tree-building and the organization of sib_groups within pages is easily accomplished if the bodies themselves are first sorted along the curve. It is complicated, but not time-consuming, to compute a key corresponding to path-length along the Peano-Hilbert curve by interleaving the bits from an integer representation of the spatial coordinates in each dimension. The sorted list of bodies may then be examined sequentially and sib_groups in the tree may be allocated and completed in the desired order with no out-of-core pointer-chasing or backtracking. As new bodies are examined, a ``current'' group is maintained in normal memory. (The storage required is only proportional to the depth of the tree, not the number of bodies). If the new body is outside the current group, we are certain to have seen every member of the current group, so appropriate computations to fill in the moment data structures may be executed, and out-of-core space can be reserved at the appropriate level to hold a sib_group corresponding to the current group. Parents of the current group are examined and completed until the new body falls within the new current group. Now, the new body is either added directly (if the number of bodies already contained is less than m) to the current group, or a new sub-group is created and the bodies already in the group are moved to the sub-group, which becomes the current group.

Thus, we have reduced the problem of out-of-core tree-building to the problem of out-of-core sorting. We use a bucket-sort, in which we first sort into buckets which are themselves many pages in length, but small enough to fit in memory. Then the buckets are paged back in, and an in-core quicksort algorithm is applied to each bucket. Rather than implement sorting as a separate, atomic, stand-alone procedure, we integrate the ``bucketizing'' phase into the position-update from the ``previous'' timestep. Thus, when we first learn the new position of a particle, we immediately select which bucket it will go in, and place it there directly, rather than first copying it to disk, and then reading it back in from disk only to select a bucket and ship it back out.

There appears to be considerable tension between optimization for out-of-core performance (and perhaps to a lesser extent, cache performance) and generally accepted standards of good programming practice. Good program design stresses modularity and encapsulation with each software component having a clear, limited interface and behavior. Applied naively, these ideas can lead to many more page-swaps (or cache misses) than necessary. For example, a standard high-level program design for an N-body code would probably separate sorting, tree-building, force computation, diagnostic accumulation and time integration into separate, independent modules. Each of these modules requires a complete scan of the list of particles, so the entire data set would be paged in and out many times. To recover performance, one combines the time integration, diagnostic accumulation and force computation together in a single pass. More dramatically, the sort is split into two phases, as described above. This leaves only two cycles of paging, which may result in a threefold speedup if the program is bandwidth limited. But it also results in a code where modularity is far less apparent. Sorting, in particular, no longer exists as a clear, separable component so it is more difficult to apply the results of other research efforts in parallel sorting.

Next: Parallelism Up: ParallelOut-of-core methods for Previous: Data Structures

John Salmon
Wed Jan 1 23:00:51 PST 1997