Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Rushabh Lathia <rushabh(dot)lathia(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
Subject: Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
Date: 2018-01-26 06:30:50
Message-ID: CAH2-WznUWoxD71cGjvA1Oo_9xheG4A77HKVKPFd-Q8UOES30UQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jan 25, 2018 at 10:00 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>> At this point, my preferred solution is for someone to go implement
>> Amit's WaitForParallelWorkersToAttach() idea [1] (Amit himself seems
>> like the logical person for the job).
>>
>
> I can implement it and share a prototype patch with you which you can
> use to test parallel sort stuff.

That would be great. Thank you.

> I would like to highlight the
> difference which you will see with WaitForParallelWorkersToAttach as
> compare to WaitForParallelWorkersToFinish() is that the former will
> give you how many of nworkers_launched workers are actually launched
> whereas latter gives an error if any of the expected workers is not
> launched. I feel former is good and your proposed way of calling it
> after the leader is done with its work has alleviated the minor
> disadvantage of this API which is that we need for workers to startup.

I'm not sure that it makes much difference, though, since in the end
WaitForParallelWorkersToFinish() is called anyway, much like
nodeGather.c. Have I missed something?

I had imagined that WaitForParallelWorkersToAttach() would give me an
error in the style of WaitForParallelWorkersToFinish(), without
actually waiting for the parallel workers to finish.

> However, now I see that you and Thomas are trying to find a different
> way to overcome this problem differently, so not sure if I should go
> ahead or not. I have seen that you told you wanted to look at
> Thomas's proposed stuff carefully tomorrow, so I will wait for you
> guys to decide which way is appropriate.

I suspect that the overhead of Thomas' experimental approach is going
to causes problems in certain cases. Cases that are hard to foresee.
That patch makes HandleParallelMessages() set ParallelMessagePending
artificially, pending confirmation of having launched all workers.

It was an interesting experiment, but I think that your
WaitForParallelWorkersToAttach() idea has a better chance of working
out.

>> Once that's committed, I can
>> post a new version of the patch that uses that new infrastructure --
>> I'll add a call to the new function, without changing anything else.
>> Failing that, we could actually just use
>> WaitForParallelWorkersToFinish(). I still don't want to use a barrier,
>> mostly because it complicates parallel_leader_participation=off,
>> something that Amit is in agreement with [2][3].
>>
>
> I think if we want we can use barrier API's to solve this problem, but
> I kind of have a feeling that it doesn't seem to be the most
> appropriate API, especially because existing API like
> WaitForParallelWorkersToFinish() can serve the need in a similar way.

I can't see a way in which using a barrier can have less complexity. I
think it will have quite a bit more, and I suspect that you share this
feeling.

> Just to conclude, following are proposed ways to solve this problem:
>
> 1. Implement a new API WaitForParallelWorkersToAttach and use that to
> solve this problem. Peter G. and Amit thinks, this is a good way to
> solve this problem.
> 2. Use existing API WaitForParallelWorkersToFinish to solve this
> problem. Peter G. feels that if API mentioned in #1 is not available,
> we can use this to solve the problem and I agree with that position.
> Thomas is not against it.
> 3. Use Thomas's new way to detect such failures. It is not clear to
> me at this stage if any one of us have accepted it to be the way to
> proceed, but Thomas and Peter G. want to investigate it further.
> 4. Use of Barrier API to solve this problem. Robert appears to be
> strongly in favor of this approach.

That's a good summary.

The next revision of the patch will make the
leader-participates-as-worker spool/Tuplelsortstate start and finish
sorting before the main leader spool/Tuplelsortstate is even started.
I did this with the intention of making it very clear that my approach
does not assume a number of participants up-front -- that is actually
something we only need a final answer on at the point that the leader
merges, which is logically the last possible moment.

Hopefully this will reassure Robert. It is quite a small change, but
leads to a slightly cleaner organization within nbtsort.c, since
_bt_begin_parallel() is the only point that has to deal with leader
participation. Another minor advantage is that this makes the
trace_sort overheads/duration for each of the two tuplesort within the
leader non-overlapping (when the leader participates as a worker).

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Moser 2018-01-26 06:55:11 Re: [HACKERS] [PROPOSAL] Temporal query processing with range types
Previous Message Amit Kapila 2018-01-26 06:28:35 Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key