Re: WIP: further sorting speedup

From: "Luke Lonergan" <llonergan(at)greenplum(dot)com>
To: pgsql-patches(at)postgreSQL(dot)org, "Tim Kordas" <tkordas(at)greenplum(dot)com>
Subject: Re: WIP: further sorting speedup
Date: 2006-02-20 03:47:43
Message-ID: C01E7BDF.1CD66%llonergan@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-patches

Cool!

We¹ll test this sometime soon and get back to you. We¹re kind of jammed
this week, hopefully we¹ll get some time.

So you know, we¹ve done some more work on the external sort to remove the
³tape² abstraction from the code, which makes a significant improvement.
This involved removing both the Knuth tapes, and the logtape.c codepath.
The result is a reasonable improvement in performance (tens of percent), and
a dramatic reduction in the amount of code.

Since we¹re looking for a 4-fold improvement based on comparisons to ³other
commercial databases², we feel we¹re not done yet. Our next step (before we
got jammed getting our latest MPP release out) was to implement these:
* Locate the cause for the excessive time in heap_getattr (you just did it)
* Implement something other than replacement selection for creating runs to
optimize cache use

- Luke

On 2/19/06 6:40 PM, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> After applying Simon's recent sort patch, I was doing some profiling and
> noticed that sorting spends an unreasonably large fraction of its time
> extracting datums from tuples (heap_getattr or index_getattr). The
> attached patch does something about this by pulling out the leading sort
> column of a tuple when it is received by the sort code or re-read from a
> "tape". This increases the space needed by 8 or 12 bytes (depending on
> sizeof(Datum)) per in-memory tuple, but doesn't cost anything as far as
> the on-disk representation goes. The effort needed to extract the datum
> at this point is well repaid because the tuple will normally undergo
> multiple comparisons while it remains in memory. In some quick tests
> the patch seemed to make for a significant speedup, on the order of 30%,
> despite increasing the number of runs emitted because of the smaller
> available memory.
>
> The choice to pull out just the leading column, rather than all columns,
> is driven by concerns of (a) code complexity and (b) memory space.
> Having the extra columns pre-extracted wouldn't buy anything anyway
> in the common case where the leading key determines the result of
> a comparison.
>
> This is still WIP because it leaks memory intra-query (I need to fix it
> to clean up palloc'd space better). I thought I'd post it now in case
> anyone wants to try some measurements for their own favorite test cases.
> In particular it would be interesting to see what happens for a
> multi-column sort with lots of duplicated keys in the first column,
> which is the case where the least advantage would be gained.
>
> Comments?
>
> regards, tom lane
>
>

In response to

Responses

Browse pgsql-patches by date

  From Date Subject
Next Message Tom Lane 2006-02-20 04:19:02 Re: WIP: further sorting speedup
Previous Message Neil Conway 2006-02-20 03:32:24 Re: ScanDirections