Quick Links

Re: WIP: further sorting speedup

From:	"Luke Lonergan" <llonergan(at)greenplum(dot)com>
To:	pgsql-patches(at)postgreSQL(dot)org, "Tim Kordas" <tkordas(at)greenplum(dot)com>
Subject:	Re: WIP: further sorting speedup
Date:	2006-02-20 03:47:43
Message-ID:	C01E7BDF.1CD66%llonergan@greenplum.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-patches

Cool!

We¹ll test this sometime soon and get back to you. We¹re kind of jammed
this week, hopefully we¹ll get some time.

So you know, we¹ve done some more work on the external sort to remove the
³tape² abstraction from the code, which makes a significant improvement.
This involved removing both the Knuth tapes, and the logtape.c codepath.
The result is a reasonable improvement in performance (tens of percent), and
a dramatic reduction in the amount of code.

Since we¹re looking for a 4-fold improvement based on comparisons to ³other
commercial databases², we feel we¹re not done yet. Our next step (before we
got jammed getting our latest MPP release out) was to implement these:
* Locate the cause for the excessive time in heap_getattr (you just did it)
* Implement something other than replacement selection for creating runs to
optimize cache use

- Luke

On 2/19/06 6:40 PM, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> After applying Simon's recent sort patch, I was doing some profiling and
> noticed that sorting spends an unreasonably large fraction of its time
> extracting datums from tuples (heap_getattr or index_getattr). The
> attached patch does something about this by pulling out the leading sort
> column of a tuple when it is received by the sort code or re-read from a
> "tape". This increases the space needed by 8 or 12 bytes (depending on
> sizeof(Datum)) per in-memory tuple, but doesn't cost anything as far as
> the on-disk representation goes. The effort needed to extract the datum
> at this point is well repaid because the tuple will normally undergo
> multiple comparisons while it remains in memory. In some quick tests
> the patch seemed to make for a significant speedup, on the order of 30%,
> despite increasing the number of runs emitted because of the smaller
> available memory.
>
> The choice to pull out just the leading column, rather than all columns,
> is driven by concerns of (a) code complexity and (b) memory space.
> Having the extra columns pre-extracted wouldn't buy anything anyway
> in the common case where the leading key determines the result of
> a comparison.
>
> This is still WIP because it leaks memory intra-query (I need to fix it
> to clean up palloc'd space better). I thought I'd post it now in case
> anyone wants to try some measurements for their own favorite test cases.
> In particular it would be interesting to see what happens for a
> multi-column sort with lots of duplicated keys in the first column,
> which is the case where the least advantage would be gained.
>
> Comments?
>
> regards, tom lane
>
>

In response to

WIP: further sorting speedup at 2006-02-20 02:40:46 from Tom Lane

Responses

Re: WIP: further sorting speedup at 2006-02-20 04:19:02 from Tom Lane

Browse pgsql-patches by date

	From	Date	Subject
Next Message	Tom Lane	2006-02-20 04:19:02	Re: WIP: further sorting speedup
Previous Message	Neil Conway	2006-02-20 03:32:24	Re: ScanDirections