Re: Parallel tuplesort (for parallel B-Tree index creation)

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
Subject: Re: Parallel tuplesort (for parallel B-Tree index creation)
Date: 2016-09-24 13:03:07
Message-ID: CAM3SWZS6g0fM=W9rXof_BrVRUpyX0tFFcCBApYoxbW1ieYGDeg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Sep 21, 2016 at 5:52 PM, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> I find this unification business really complicated.

I can certainly understand why you would. As I said, it's the most
complicated part of the patch, which overall is one of the most
ambitious patches I've ever written.

> I think it'd be simpler
> to keep the BufFiles and LogicalTapeSets separate, and instead teach
> tuplesort.c how to merge tapes that live on different
> LogicalTapeSets/BufFiles. Or refactor LogicalTapeSet so that a single
> LogicalTapeSet can contain tapes from different underlying BufFiles.
>
> What I have in mind is something like the attached patch. It refactors
> LogicalTapeRead(), LogicalTapeWrite() etc. functions to take a LogicalTape
> as argument, instead of LogicalTapeSet and tape number. LogicalTapeSet
> doesn't have the concept of a tape number anymore, it can contain any number
> of tapes, and you can create more on the fly. With that, it'd be fairly easy
> to make tuplesort.c merge LogicalTapes that came from different tape sets,
> backed by different BufFiles. I think that'd avoid much of the unification
> code.

I think that it won't be possible to make a LogicalTapeSet ever use
more than one BufFile without regressing the ability to eagerly reuse
space, which is almost the entire reason for logtape.c existing. The
whole indirect block thing is an idea borrowed from the FS world, of
course, and so logtape.c needs one block-device-like BufFile, with
blocks that can be reclaimed eagerly, but consumed for recycling in
*contiguous* order (which is why they're sorted using qsort() within
ltsGetFreeBlock()). You're going to increase the amount of random I/O
by using more than one BufFile for an entire tapeset, I think.

This patch you posted
("0001-Refactor-LogicalTapeSet-LogicalTape-interface.patch") just
keeps one BufFile, and only changes the interface to expose the tapes
themselves to tuplesort.c, without actually making tuplesort.c do
anything with that capability. I see what you're getting at, I think,
but I don't see how that accomplishes all that much for parallel
CREATE INDEX. I mean, the special case of having multiple tapesets
from workers (not one "unified" tapeset created from worker temp files
from their tapesets to begin with) now needs special treatment.
Haven't you just moved the complexity around (once your patch is made
to care about parallelism)? Having multiple entire tapesets explicitly
from workers, with their own BufFiles, is not clearly less complicated
than managing ranges from BufFile fd.c files with delineated ranges of
"logical tapeset space". Seems almost equivalent, except that my way
doesn't bother tuplesort.c with any of this.

>> + * As a consequence of only being permitted to write to the leader
>> + * controlled range, parallel sorts that require a final
>> materialized tape
>> + * will use approximately twice the disk space for temp files
>> compared to
>> + * a more or less equivalent serial sort.

> I'm slightly worried about that. Maybe it's OK for a first version, but it'd
> be annoying in a query where a sort is below a merge join, for example, so
> that you can't do the final merge on the fly because mark/restore support is
> needed.

My intuition is that we'll *never* end up using this for merge joins.
I think that I could do better here (why should workers really care at
this point?), but just haven't bothered to.

This parallel sort implementation is something written with CREATE
INDEX and CLUSTER in mind only (maybe one or two other things, too). I
believe that for query execution, partitioning is the future [1]. With
merge joins, partitioning is desirable because it lets you push down
*everything* to workers, not just sorting (e.g., by aligning
partitioning boundaries on each side of each merge join sort in the
worker, and having the worker also "synchronize" each side of the
join, all independently and without a dependency on a final merge).

That's why I think it's okay that I use twice as much space for
randomAccess tuplesort.c callers. No real world caller will ever end
up needing to do this. It just seems like a good idea to support
randomAccess when using this new infrastructure, on general principle.
Forcing myself to support that case during initial development
actually resulted in much cleaner, less invasive changes to
tuplesort.c in general.

[1] https://www.postgresql.org/message-id/flat/CAM3SWZR+ATYAzyMT+hm-Bo=1L1smtJbNDtibwBTKtYqS0dYZVg(at)mail(dot)gmail(dot)com#CAM3SWZR+ATYAzyMT+hm-Bo=1L1smtJbNDtibwBTKtYqS0dYZVg@mail.gmail.com
--
Peter Geoghegan

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2016-09-24 13:06:49 Re: Possibly too stringent Assert() in b-tree code
Previous Message Amit Kapila 2016-09-24 12:55:28 Re: raised checkpoint limit & manual checkpoint