Merge Joins require us to potentially Mark and Restore positions in the
tuples arriving from executor sub-nodes.
This currently means that if the tuples arrive from a Sort node, as they
often do in an MJ, the sort node will be instructed to prepare a random
access version of the sort result. That requires a full final merge of
the output, so as to allow rewinding the input when a Restore operation
An MJ doesn't actually need random access, it just needs to be able to
rewind. The question is: how far does it need to rewind? In many cases,
the Restore operation moves back a small number of tuples, with a unique
inner scan requiring a rewind of just one tuple.
It would certainly be cheaper, in most cases, for the Sort node to
maintain a variable size rewind buffer, where the history of prior
tuples is truncated each time we perform a Mark operation. This could be
implemented as a modified Tuplestore that could then be trimmed down
each time a Mark operation took place. If the tuplestore has not yet
spilled to disk this could be a trivial operation.
Doing that would almost completely remove the overhead of the final
merge step in the sort. The final merge often doubles elapsed time in
cases where the sort is larger than work_mem, which it often is.
Implementing the variable mark/restore buffer as a dumb Tuplestore would
mean that the space usage of the Sort could in worst case go as high as
x2 total space. The worst case is where the inner scan is all a single
value. The best case is where the inner scan is sufficiently unique over
all its values that it never writes back to disk at all.
So a further refinement of this idea would be to simply defer the final
merge operation for the sort until the history required for the Mark
operation exceeded, say, 10% of the sort size. That would then be
sufficient to improve performance for most common cases, without risking
massive space overflow for large and highly non-unique data. There's no
problem with running the final merge slightly later than before;
everything's still there to allow it. Reusing space in the tuplestore is
also straightforward since that's exactly what the final merge already
does, so some rework of that code should be sufficient.
This is a separate, but related idea of being able to avoid
mark/restores completely when the outer scan is provably unique.
I'm not intending to implement this idea just yet, but it seemed worth
recording since it occurred to me - and discussing it as a TODO item.
pgsql-hackers by date
|Next:||From: Gregory Stark||Date: 2007-01-03 19:17:22|
|Subject: Re: Operator class group proposal|
|Previous:||From: Tom Lane||Date: 2007-01-03 18:17:32|
|Subject: Re: Operator class group proposal |