In fact, we have implemented a parallel out-of-core treecode primarily through a small extension of the uni-processor paging abstraction. In addition to a page number, the identifier that is passed to, e.g., PgRef, also contains a processor number. The implementation of PgRef et al distinguishes between local pages, which are accessed with read and write, and remote pages, which are accessed via message exchange with the a ``server'' on the remote processor. There is no problem with coherence because once the tree is built (a purely local operation involving writing data), all further access is read-only. We must be certain to purge all remote pages between timesteps, though. The ``server'' is simply an occasional poll for page requests during force evaluation. It is remarkable that this simple, blocking, synchronous approach is successful. At the outset, we assumed multiple servers would have to run asynchronously in separate threads but performance is entirely adequate without such sophisticated infrastructure.
With this minor extension to the underlying paging system, the implementation of the basic tree traversal, force evaluation and time integration algorithms was completely unchanged. Some additional code was added to the tree-build phase to allow processors to build local trees independently, and then merge those sib_groups whose spatial domain extends over multiple processors. There is also a requirement for parallel data decomposition code which assigns processors to regions of space and moves bodies appropriately, but both of these are essentially identical to the equivalent components in in-core parallel codes.