Re: Parallel tuplesort (for parallel B-Tree index creation)

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Peter Geoghegan <pg(at)heroku(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
Subject: Re: Parallel tuplesort (for parallel B-Tree index creation)
Date: 2016-12-21 14:00:21
Message-ID: CA+TgmoZAKjjkTk=N=opk1-yr7ySzbzKTz333Oj8LY70Neiqywg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Dec 21, 2016 at 7:04 AM, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>> If the worker is always completely finished with the tape before the
>> leader touches it, couldn't the leader's LogicalTapeSet just "adopt"
>> the tape and overwrite it like any other?
>
> Currently, the logical tape code assumes that all tapes in a single
> LogicalTapeSet are allocated from the same BufFile. The logical tape's
> on-disk format contains block numbers, to point to the next/prev block of
> the tape [1], and they're assumed to refer to the same file. That allows
> reusing space efficiently during the merge. After you have read the first
> block from tapes A, B and C, you can immediately reuse those three blocks
> for output tape D.

I see. Hmm.

> Now, if you read multiple tapes, from different LogicalTapeSet, hence backed
> by different BufFiles, you cannot reuse the space from those different tapes
> for a single output tape, because the on-disk format doesn't allow referring
> to blocks in other files. You could reuse the space of *one* of the input
> tapes, by placing the output tape in the same LogicalTapeSet, but not all of
> them.
>
> We could enhance that, by using "filename + block number" instead of just
> block number, in the pointers in the logical tapes. Then you could spread
> one logical tape across multiple files. Probably not worth it in practice,
> though.

OK, so the options as I understand them are:

1. Enhance the logical tape set infrastructure in the manner you
mention, to support filename (or more likely a proxy for filename) +
block number in the logical tape pointers. Then, tapes can be
transferred from one LogicalTapeSet to another.

2. Enhance the BufFile infrastructure to support some notion of a
shared BufFile so that multiple processes can be reading and writing
blocks in the same BufFile. Then, extend the logical tape
infrastruture so that we also have the notion of a shared LogicalTape.
This means that things like ltsGetFreeBlock() need to be re-engineered
to handle concurrency with other backends.

3. Just live with the waste of space.

I would guess that (1) is easier than (2). Also, (2) might provoke
contention while writing tapes that is otherwise completely
unnecessary. It seems silly to have multiple backends fighting over
the same end-of-file pointer for the same file when they could just
write to different files instead.

Another tangentially-related problem I just realized is that we need
to somehow handle the issues that tqueue.c does when transferring
tuples between backends -- most of the time there's no problem, but if
anonymous record types are involved then tuples require "remapping".
It's probably harder to provoke a failure in the tuplesort case than
with parallel query per se, but it's probably not impossible.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2016-12-21 14:04:01 Re: pgstattuple documentation clarification
Previous Message Ashutosh Sharma 2016-12-21 13:52:05 Add pgstathashindex() to get hash index table statistics.