The first step is to formulate the algorithms and data structures in a way suitable for out-of-core evaluation. We are used to thinking in terms of floating point operations: counting them, pipelining them, finding ways to eliminate them altogether. The validity of this mindset is questionable in the context of modern computer systems where uncached memory accesses can be an order of magnitude more expensive than floating point operations. It could be disastrous in an out-of-core code where one million ``extra'' flops might be no more expensive than a single disk access. We must turn our attention primarily to concerns related to data movement, with operation count considerably less important.
Two concerns drive the design of out-of-core algorithms. Latency tolerance and bandwidth economy. The first implies that whenever a datum is required from the disk, we must be prepared to wait hundreds of thousands of cycles for it to be delivered. The second says that whenever we have gone to the trouble of moving a datum from disk to memory, we must use it enough times to amortize the cost of having done so.
We have chosen a simple approach to latency tolerance. We simply ensure that all transfers to and from the disk are sufficiently large that latency is irrelevant. The conventional formula for the time, t, to service a request is:
where is the latency (10ms for a typical commodity disk), b is
the bandwidth (1.5MB/s for the same disk) and s is the size of the
request. If we ensure that all transfers satisfy
, (15kB
for our typical case), then latency will not be a dominant factor in
the overall performance. Of course, we have now made the bandwidth
problem somewhat harder. We must design an algorithm that makes
optimal use of the available bandwidth but which is restricted to
transferring data in chunks no smaller than 15kB.
Multi-threading is an alternative approach to latency hiding
that allows the processor to do something else
while waiting for requests to be processed. Multi-threading is
essential if it is not practical to make large requests as we propose
to do, but effective multi-threading often relies on special hardware
features, and we have not explored its use in treecodes.
To address the bandwidth problem, we must design our data structures and algorithms so that whenever a datum is moved from a disk to memory, we get the maximum possible use of it before returning it to the disk or disposing of it. We must also minimize unnecessary copies, i.e., situations in which data that has not been modified in memory is copied back to disk anyway because the system wasn't sure that it was safe, or because ``dirty'' and ``clean'' data are interspersed in a single page.
In summary, the following rules guide the implementation our out-of-core treecode: