Re: Parallel tuplesort (for parallel B-Tree index creation)

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
Subject: Re: Parallel tuplesort (for parallel B-Tree index creation)
Date: 2017-02-01 07:46:43
Message-ID: CAH2-Wz=xA7ZWO7Qp7631w+V4EOQ0r3x5HMVZXbkGsqqHzYuLmQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jan 31, 2017 at 11:23 PM, Thomas Munro
<thomas(dot)munro(at)enterprisedb(dot)com> wrote:
> 2. All participants: parallel sequential scan, sort, spool to disk;
> barrier; leader: merge spooled tuples and build btree.
>
> This patch is doing the 2nd thing. My understanding is that some
> systems might choose to do that if they don't have or don't like the
> table's statistics, since repartitioning for balanced load requires
> carefully chosen ranges and is highly sensitive to distribution
> problems.

The second thing here seems to offer comparable scalability to other
system implementation's of the first thing. They seem to have reused
"partitioning to sort in parallel" for B-Tree builds, at least in some
cases, despite this. WAL logging is the biggest serial bottleneck here
for other systems, I've heard -- that's still going to be pretty much
serial.

I think that the fact that some systems do partitioning for parallel
B-Tree builds might have as much to do with their ability to create
B-Tree indexes in place as anything else. Apparently, some systems
don't use temp files, instead writing out what is for all intents and
purposes part of a finished B-Tree as runs (no use of
temp_tablespaces). That may be a big part of what makes it worthwhile
to try to use partitioning. I understand that only the highest client
counts will see much direct performance benefit relative to the first
approach.

> It's pretty clear that approach 1 is a difficult project. From my
> research into dynamic repartitioning in the context of hash joins, I
> can see that that infrastructure is a significant project in its own
> right: subproblems include super efficient tuple exchange, buffering,
> statistics/planning and dealing with/adapting to bad outcomes. I also
> suspect that repartitioning operators might need to be specialised for
> different purposes like sorting vs hash joins, which may have
> differing goals. I think it's probably easy to build a slow dynamic
> repartitioning mechanism that frequently results in terrible worst
> case scenarios where you paid a fortune in IPC overheads and still
> finished up with one worker pulling most of the whole load. Without
> range partitioning, I don't believe you can merge the resulting
> non-disjoint btrees efficiently so you'd probably finish up writing a
> complete new btree to mash them together. As for merging disjoint
> btrees, I assume there are ways to do a structure-preserving merge
> that just rebuilds some internal pages and incorporates the existing
> leaf pages directly, a bit like tree manipulation in functional
> programming languages; that'll take some doing.

I agree with all that. "Stitching together" disjoint B-Trees does seem
to have some particular risks, which users of other systems are
cautioned against in their documentation. You can end up with an
unbalanced B-Tree.

> So I'm in favour of this patch, which is relatively simple and give us
> faster index builds soon. Eventually we might also be able to have
> approach 1. From what I gather, it's entirely possible that we might
> still need 2 to fall back on in some cases.

Right. And it can form the basis of an implementation of 1, which in
any case seems to be much more compelling for parallel query, when a
great deal more can be pushed down, and we are not particularly likely
to be I/O bound (usually not much writing to the heap, or WAL
logging).

> Will you move the BufFile changes to a separate patch in the next revision?

That is the plan. I need to get set up with a new machine here, having
given back my work laptop to Heroku, but it shouldn't take too long.

Thanks for the review.
--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2017-02-01 08:17:23 Re: Deadlock in XLogInsert at AIX
Previous Message Kyotaro HORIGUCHI 2017-02-01 07:41:38 Re: [HACKERS] Bug in Physical Replication Slots (at least 9.5)?