As presented in Section 4, the IsoDen method sorts particles in order of decreasing density, and then examines them in that order to generate a list of halos. At first glance, this procedure is inherently sequential, since changing the order of operations would significantly alter the result. Thus, we are forced to create an alternative formulation that achieves exactly the same result, but that allows for a mostly parallel implementation.
The initial step is the same. We calculate densities
and linking
lengths for individual particles. This calculation uses the
same techniques, and is implemented using the same libraries, as
the the density calculation required by the Smooth Particle Hydrodynamics
method [13]. By using the parallel tree libraries,
it parallelizes immediately.
The library handles all data structure
manipulation and explicit communication on message passing parallel
architectures.
The library was originally designed to implement parallel ``Fast'' summation
algorithms, e.g. [1],
for the computation of gravitational interactions in
time, but the data
structures are far more general.
The library has
been ported to a wide variety of systems including message-passing
and shared-memory machines, as well as networks of workstations and
uniprocessor systems. In particular, the results presented here
for FOF and IsoDen methods were obtained with library implementations which
managed all parallelism and data distribution
on
single and multi-processor SPARC workstations,
a 32-node CM-5 at the Australian National University
and a 512-node Intel Paragon at Caltech.
A crucial feature of the density-estimation problem is that the particle positions and masses remain fixed throughout the algorithm, and thus may be freely copied between processors without concern for coherence. In fact, the library addresses a slightly more general case which occurs in dynamical simulations -- particle positions may updated only after an implicit or explicit barrier synchronization. That is, forces (or densities) are computed under the assumption that all data is static. A barrier is reached, after which the positions may be altered by the dynamics, followed by another barrier indicating that all positions have been updated and another force calculation, etc.
The parallel tree library distributes the data so that a particular
result, e.g., a density estimate for a particular particle
is computed by only one processor. The assignment of particles to
processors must satisfy two competing goals: the processing load must
be balanced, and the necessary
interprocessor communication must be minimized.
Load balancing of highly irregular
data sets is achieved by sorting particle positions along a
Peano-Hilbert curve (see Figure 7a), and then chopping the
curve into equal pieces. This has the effect of achieving
load balance while maintaining spatial locality (and thereby minimizing
communication overhead).
Neighbor-finding (and hence evaluation of equation 2),
is done by constructing an adaptive oct-tree with individual
particles at the leaves and internal nodes corresponding to cubical regions
of space. The tree is traversed once for each particle in time proportional
to , so that all neighbors are found in
time.
In parallel, the tree is constructed with ``remote'' pointers to
cells that are stored on other processors. When a particular traversal
encounters such a pointer, a communication is initiated requesting the
data from the remote processor. The substantial latency of such requests
is overcome by a) bundling requests together into larger blocks before
actually beginning interprocessor communication and b)
working on another branch of the tree while the request is being
processed. Furthermore, once requested, the contents of a ``remote'' pointer
are stored locally, so that subsequent visits to the same cell do not incur
the cost of interprocessor communication. This explains the value
of spatial locality in the assignment of particles to processors.
The neighbor-sets of nearby particles have a great deal of overlap, so
that the communication cost of obtaining the neighbors from remote processors
is amortized over all the local particles that are neighbors of the remote
ones.
All of the rather substantial bookkeeping involved is handled by the
parallel tree library.
Upon completion of the density estimation phase,
every particle has a list
of other particles to which it is linked.
Storing these lists for all particles at once would
significantly increase memory requirements.
Therefore, we traverse the tree multiple times rather than exploiting
the ``obvious'' optimization of recording explicit lists of
linked particles.
For each particle,
we identify the highest density neighbor (HDN), i.e.,
highest density member of the set of particles to which the given particle
is directly linked.
Some particles are their own HDN. These are precisely the
central particles
which define for the tentative halos in the sequential
formulation, and we give them a unique halo number.
Now we scan the list of particles until we have assigned a
preliminary halo number to each particle. For each particle,
we check its HDN. If its HDN has a preliminary halo number assigned,
then the particle inherits that preliminary halo number. If the
particle's HDN is not yet assigned, we revisit the particle again on
the next iteration. Note that the graph defined by the neighbor
relationship is highly interconnected (every particle is connected to
others), so only a few iterations are required to propagate
the preliminary halo numbers to the entire system. Some
interprocessor communication is required at the start of each
iteration to propagate new information about particles that have been
assigned halo numbers on the previous iteration.
Next, we examine the set of particles and links again, and for each
pair of preliminary halo numbers, we identify the highest-density
particle that is assigned to one of the two halos, and is linked to a
higher density particle that is assigned to the other halo.
The particles identified in this way are the
overlap particles from the sequential formulation.
Note that most pairs of halos are not linked at all, so this step
only requires space.
It is easily accomplished by visiting all of the particle-particle links,
and maintaining a sparse data structure that records the highest-density
particle so far encountered that links each pair of halos.
In parallel, when all processors are done with their
own data, these local data structures are combined to produce a global
structure.
Now it is possible to construct the tree of halos, where the leaves correspond to central regions of isolated halos and the internal nodes correspond to composite halos that meet at overlap particles. One sorts the overlap particles in order of decreasing density, and defines a new composite halo for each overlap. As composite halos are created, the overlap points merge, so that if A and B are merged into M, and they both overlapped with C, then M also overlaps with C, at the maximum of the two previous overlaps between A-C, and B-C. Noise suppression criteria can be implemented here as well, i.e., preliminary halos may be regarded as tentative until their density exceeds a statistical threshold over the background density, or until they pass the evaporative test. Construction of the tree of halos is, unfortunately, inherently sequential, because the overlaps must be treated in order of decreasing density. It is far faster than a naive implementation of the formulation in Section 4, though, because we have identified the overlap particles in parallel, and it is only the tree construction that is sequential. Furthermore, the tree construction only uses data related to the overlaps. The much larger particle data set can be ignored while the logical structure of the tree of halos is constructed from the maximum density overlaps. In practice, it is acceptably fast, even on highly parallel systems. See Section 6 for details of the scalability.
Once the tree of halos has been constructed, with density levels indicating where halos merge into one another, it is a simple matter to distribute the tree to the processors in a parallel system, and then assign each particle to a final, genuine halo in parallel. Starting with its preliminary halo number (at a leaf of the tree), one simply ascends the tree of halos (toward the root) until the overlap density is less than the particle's own density. The particle is assigned to the composite halo defined by this location in the tree.
Friends of friends can be implemented almost as a special case of IsoDen, using constant kernel and linking lengths, and using the number of friends as a measure of density. In the FOF case all halo overlaps result in the overlapping halos being merged together, and the substructure before the overlap is forgotten.