Re: Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort"

From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Peter Geoghegan <pg(at)heroku(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Jeremy Harris <jgh(at)wizmail(dot)org>
Subject: Re: Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort"
Date: 2015-07-30 10:47:46
Message-ID: CANP8+jLbSnxLT+ab90Hyj6gYDu45bZfZ4j33XEEnUM7BZpw7VA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 30 July 2015 at 08:00, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:

> Hmm. You don't really need to merge the in-memory array into the tape, as
> you know that all the tuples in the in-memory must come after the tuples
> already on the tape. You can just return all the tuples from the tape
> first, and then all the tuples from the array.
>

Agreed

This is a good optimization for the common case where tuples are mostly
already in order. We could increase the usefulness of this by making UPDATE
pick blocks that are close to a tuple's original block, rather than putting
them near the end of a relation.

> So here's a shorter/different explanation of this optimization: When it's
> time to perform the sort, instead of draining the in-memory heap one tuple
> at a time to the last tape, you sort the heap with quicksort, and pretend
> that the sorted heap belongs to the last tape, after all the other tuples
> in the tape.
>
> Some questions/thoughts on that:
>
> Isn't that optimization applicable even when you have multiple runs?
> Quicksorting the heap and keeping it as an array in memory is surely always
> faster than heapsorting and pushing it to the tape.
>

It's about use of memory. If you have multiple runs on tape, then they will
need to be merged and you need memory to do that efficiently. If there are
tuples in the last batch still in memory then it can work, but it depends
upon how full memory is from the last batch and how many batches there are.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2015-07-30 11:09:54 Re: Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort"
Previous Message Heikki Linnakangas 2015-07-30 10:44:37 Re: pg_dump quietly ignore missing tables - is it bug?