Thus, we have reduced the problem of out-of-core tree-building to the problem of out-of-core sorting. We use a bucket-sort, in which we first sort into buckets which are themselves many pages in length, but small enough to fit in memory. Then the buckets are paged back in, and an in-core quicksort algorithm is applied to each bucket. Rather than implement sorting as a separate, atomic, stand-alone procedure, we integrate the ``bucketizing'' phase into the position-update from the ``previous'' timestep. Thus, when we first learn the new position of a particle, we immediately select which bucket it will go in, and place it there directly, rather than first copying it to disk, and then reading it back in from disk only to select a bucket and ship it back out.
There appears to be considerable tension between optimization for out-of-core performance (and perhaps to a lesser extent, cache performance) and generally accepted standards of good programming practice. Good program design stresses modularity and encapsulation with each software component having a clear, limited interface and behavior. Applied naively, these ideas can lead to many more page-swaps (or cache misses) than necessary. For example, a standard high-level program design for an N-body code would probably separate sorting, tree-building, force computation, diagnostic accumulation and time integration into separate, independent modules. Each of these modules requires a complete scan of the list of particles, so the entire data set would be paged in and out many times. To recover performance, one combines the time integration, diagnostic accumulation and force computation together in a single pass. More dramatically, the sort is split into two phases, as described above. This leaves only two cycles of paging, which may result in a threefold speedup if the program is bandwidth limited. But it also results in a code where modularity is far less apparent. Sorting, in particular, no longer exists as a clear, separable component so it is more difficult to apply the results of other research efforts in parallel sorting.