Re: Parallel tuplesort (for parallel B-Tree index creation)

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Peter Geoghegan <pg(at)bowt(dot)ie>, Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
Subject: Re: Parallel tuplesort (for parallel B-Tree index creation)
Date: 2017-02-01 07:23:45
Message-ID: CAEepm=1SzJ-wWgPMQuDN2tLA_c74FM5GCy3_OWpq_XjQqN35Tg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 1, 2017 at 5:37 PM, Michael Paquier
<michael(dot)paquier(at)gmail(dot)com> wrote:
> On Tue, Jan 31, 2017 at 2:15 PM, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
>> On Mon, Jan 30, 2017 at 8:46 PM, Thomas Munro
>> <thomas(dot)munro(at)enterprisedb(dot)com> wrote:
>>> On Wed, Jan 4, 2017 at 12:53 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
>>>> Attached is V7 of the patch.
>>>
>>> I am doing some testing. First, some superficial things from first pass:
>>>
>>> [Various minor cosmetic issues]
>>
>> Oops.
>
> As this review is very recent, I have moved the patch to CF 2017-03.

ParallelContext *
-CreateParallelContext(parallel_worker_main_type entrypoint, int nworkers)
+CreateParallelContext(parallel_worker_main_type entrypoint, int nworkers,
+ bool serializable_okay)
{
MemoryContext oldcontext;
ParallelContext *pcxt;
@@ -143,7 +144,7 @@ CreateParallelContext(parallel_worker_main_type
entrypoint, int nworkers)
* workers, at least not until somebody enhances that mechanism to be
* parallel-aware.
*/
- if (IsolationIsSerializable())
+ if (IsolationIsSerializable() && !serializable_okay)
nworkers = 0;

That's a bit weird but I can't think of a problem with it. Workers
run with MySerializableXact == InvalidSerializableXact, even though
they may have the snapshot of a SERIALIZABLE leader. Hopefully soon
the restriction on SERIALIZABLE in parallel queries can be lifted
anyway, and then this could be removed.

Here are some thoughts on the overall approach. Disclaimer: I
haven't researched the state of the art in parallel sort or btree
builds. But I gather from general reading that there are a couple of
well known approaches, and I'm sure you'll correct me if I'm off base
here.

1. All participants: parallel sequential scan, repartition on the fly
so each worker has tuples in a non-overlapping range, sort, build
disjoint btrees; barrier; leader: merge disjoint btrees into one.

2. All participants: parallel sequential scan, sort, spool to disk;
barrier; leader: merge spooled tuples and build btree.

This patch is doing the 2nd thing. My understanding is that some
systems might choose to do that if they don't have or don't like the
table's statistics, since repartitioning for balanced load requires
carefully chosen ranges and is highly sensitive to distribution
problems.

It's pretty clear that approach 1 is a difficult project. From my
research into dynamic repartitioning in the context of hash joins, I
can see that that infrastructure is a significant project in its own
right: subproblems include super efficient tuple exchange, buffering,
statistics/planning and dealing with/adapting to bad outcomes. I also
suspect that repartitioning operators might need to be specialised for
different purposes like sorting vs hash joins, which may have
differing goals. I think it's probably easy to build a slow dynamic
repartitioning mechanism that frequently results in terrible worst
case scenarios where you paid a fortune in IPC overheads and still
finished up with one worker pulling most of the whole load. Without
range partitioning, I don't believe you can merge the resulting
non-disjoint btrees efficiently so you'd probably finish up writing a
complete new btree to mash them together. As for merging disjoint
btrees, I assume there are ways to do a structure-preserving merge
that just rebuilds some internal pages and incorporates the existing
leaf pages directly, a bit like tree manipulation in functional
programming languages; that'll take some doing.

So I'm in favour of this patch, which is relatively simple and give us
faster index builds soon. Eventually we might also be able to have
approach 1. From what I gather, it's entirely possible that we might
still need 2 to fall back on in some cases.

Will you move the BufFile changes to a separate patch in the next revision?

Still testing and reviewing, more soon.

--
Thomas Munro
http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Rahila Syed 2017-02-01 07:40:06 Re: Parallel Index Scans
Previous Message Kyotaro HORIGUCHI 2017-02-01 06:38:54 Re: [PATCH]: fix bug in SP-GiST box_ops