Re: Mark/Restore and avoiding RandomAccess sorts

From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Mark/Restore and avoiding RandomAccess sorts
Date: 2007-01-06 22:06:36
Message-ID: 200701062206.l06M6av04563@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


I saw no replies to this.

---------------------------------------------------------------------------

Simon Riggs wrote:
> Merge Joins require us to potentially Mark and Restore positions in the
> tuples arriving from executor sub-nodes.
>
> This currently means that if the tuples arrive from a Sort node, as they
> often do in an MJ, the sort node will be instructed to prepare a random
> access version of the sort result. That requires a full final merge of
> the output, so as to allow rewinding the input when a Restore operation
> is called.
>
> An MJ doesn't actually need random access, it just needs to be able to
> rewind. The question is: how far does it need to rewind? In many cases,
> the Restore operation moves back a small number of tuples, with a unique
> inner scan requiring a rewind of just one tuple.
>
> It would certainly be cheaper, in most cases, for the Sort node to
> maintain a variable size rewind buffer, where the history of prior
> tuples is truncated each time we perform a Mark operation. This could be
> implemented as a modified Tuplestore that could then be trimmed down
> each time a Mark operation took place. If the tuplestore has not yet
> spilled to disk this could be a trivial operation.
>
> Doing that would almost completely remove the overhead of the final
> merge step in the sort. The final merge often doubles elapsed time in
> cases where the sort is larger than work_mem, which it often is.
>
> Implementing the variable mark/restore buffer as a dumb Tuplestore would
> mean that the space usage of the Sort could in worst case go as high as
> x2 total space. The worst case is where the inner scan is all a single
> value. The best case is where the inner scan is sufficiently unique over
> all its values that it never writes back to disk at all.
>
> So a further refinement of this idea would be to simply defer the final
> merge operation for the sort until the history required for the Mark
> operation exceeded, say, 10% of the sort size. That would then be
> sufficient to improve performance for most common cases, without risking
> massive space overflow for large and highly non-unique data. There's no
> problem with running the final merge slightly later than before;
> everything's still there to allow it. Reusing space in the tuplestore is
> also straightforward since that's exactly what the final merge already
> does, so some rework of that code should be sufficient.
>
> This is a separate, but related idea of being able to avoid
> mark/restores completely when the outer scan is provably unique.
>
> I'm not intending to implement this idea just yet, but it seemed worth
> recording since it occurred to me - and discussing it as a TODO item.
>
> Comments?
>
> --
> Simon Riggs
> EnterpriseDB http://www.enterprisedb.com
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings

--
Bruce Momjian bruce(at)momjian(dot)us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2007-01-06 22:18:33 Re: pg_ctl options
Previous Message Bruce Momjian 2007-01-06 22:01:57 Re: InitPostgres and flatfiles question