Re: Sorting Improvements for 8.4

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Sorting Improvements for 8.4
Date: 2007-12-03 18:32:29
Message-ID: 1196706749.22428.465.camel@dogma.ljc.laika.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, 2007-12-03 at 11:51 +0000, Simon Riggs wrote:
> So Algorithm F is somewhat orthogonal to what I've proposed, though
> maybe still interesting. What we do now is fairly close, but you should
> look at the code in tuplesort.c and logtape.c to see how well it
> matches. That might lead to an increase in the limit of the number of
> concurrent runs mergeable at any one time.
>

tuplesort.c:

* When merging runs, we use a heap containing just the frontmost tuple from
* each source run; we repeatedly output the smallest tuple and insert the
* next tuple from its source tape (if any). When the heap empties, the merge
* is complete. The basic merge algorithm thus needs very little memory ---
* only M tuples for an M-way merge, and M is constrained to a small number.
* However, we can still make good use of our full workMem allocation by
* pre-reading additional tuples from each source tape. Without prereading,
* our access pattern to the temporary file would be very erratic; on average
* we'd read one block from each of M source tapes during the same time that
* we're writing M blocks to the output tape, so there is no sequentiality of
* access at all, defeating the read-ahead methods used by most Unix kernels.
* Worse, the output tape gets written into a very random sequence of blocks
* of the temp file, ensuring that things will be even worse when it comes
* time to read that tape. A straightforward merge pass thus ends up doing a
* lot of waiting for disk seeks. We can improve matters by prereading from
* each source tape sequentially, loading about workMem/M bytes from each tape
* in turn. Then we run the merge algorithm, writing but not reading until
* one of the preloaded tuple series runs out. Then we switch back to preread
* mode, fill memory again, and repeat. This approach helps to localize both
* read and write accesses.

The idea of prefetching, as I understand it, is that we don't blindly
preread workMem/M bytes from each of M tapes; instead we predict
which tapes we will need tuples from next through forecasting.

If I understand correctly, we just keep track of the maximum value of
the last block read from each run, and then always read from the run in
which the last block read has the lowest maximum.

It seems as if this would allow a variable number of runs to be merged
at once, but if the data really *is* random, we'd want it to degrade
gracefully something resembling the current implementation.

I'm being somewhat vague here because I haven't taken the time to
really understand it. If you think this idea has potential I will look
into it in more detail.

Regards,
Jeff Davis

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Magnus Hagander 2007-12-03 18:40:24 Re: buildenv.pl/buildenv.bat
Previous Message Gregory Stark 2007-12-03 17:41:50 Kludge in pg_standby.c