Re: Parallel tuplesort (for parallel B-Tree index creation)

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Peter Geoghegan <pg(at)heroku(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
Subject: Re: Parallel tuplesort (for parallel B-Tree index creation)
Date: 2016-12-21 12:04:07
Message-ID: 6f918bd0-2a8c-c8c0-664f-78b25135c033@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 12/21/2016 12:53 AM, Robert Haas wrote:
>> That leaves one problem, though: reusing space in the final merge phase. If
>> the tapes being merged belong to different LogicalTapeSets, and create one
>> new tape to hold the result, the new tape cannot easily reuse the space of
>> the input tapes because they are on different tape sets.
>
> If the worker is always completely finished with the tape before the
> leader touches it, couldn't the leader's LogicalTapeSet just "adopt"
> the tape and overwrite it like any other?

Currently, the logical tape code assumes that all tapes in a single
LogicalTapeSet are allocated from the same BufFile. The logical tape's
on-disk format contains block numbers, to point to the next/prev block
of the tape [1], and they're assumed to refer to the same file. That
allows reusing space efficiently during the merge. After you have read
the first block from tapes A, B and C, you can immediately reuse those
three blocks for output tape D.

Now, if you read multiple tapes, from different LogicalTapeSet, hence
backed by different BufFiles, you cannot reuse the space from those
different tapes for a single output tape, because the on-disk format
doesn't allow referring to blocks in other files. You could reuse the
space of *one* of the input tapes, by placing the output tape in the
same LogicalTapeSet, but not all of them.

We could enhance that, by using "filename + block number" instead of
just block number, in the pointers in the logical tapes. Then you could
spread one logical tape across multiple files. Probably not worth it in
practice, though.

[1] As the code stands, there are no next/prev pointers, but a tree of
"indirect" blocks. But I'm planning to change that to simpler next/prev
pointers, in
https://www.postgresql.org/message-id/flat/55b3b7ae-8dec-b188-b8eb-e07604052351%40iki.fi

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Craig Ringer 2016-12-21 12:09:56 Re: Replication slot xmin is not reset if HS feedback is turned off while standby is shut down
Previous Message Joel Jacobson 2016-12-21 11:53:17 Re: SET NOT NULL [NOT VALID / CONCURRENTLY]?