Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rushabh Lathia <rushabh(dot)lathia(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
Subject: Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
Date: 2018-01-18 14:14:55
Message-ID: CAA4eK1JawFkqkP8xn1aWHTDzQLkAnsDFxVAXmbKCnOW1u4MhSA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jan 18, 2018 at 8:52 AM, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> On Wed, Jan 17, 2018 at 10:40 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
>>> (It might make sense to allow this if parallel_leader_participation
>>> was *purely* a testing GUC, only for use by by backend hackers, but
>>> AFAICT it isn't.)
>>
>> As applied to parallel CREATE INDEX, it pretty much is just a testing
>> GUC, which is why I was skeptical about leaving support for it in the
>> patch. There's no anticipated advantage to having the leader not
>> participate -- unlike for parallel queries, where it is quite possible
>> that setting parallel_leader_participation=off could be a win, even
>> generally. If you just have a Gather over a parallel sequential scan,
>> it is unlikely that parallel_leader_participation=off will help; it
>> will most likely hurt, at least up to the point where more
>> participants become a bad idea in general due to contention.
>
> It's unlikely to hurt much, since as you yourself said,
> compute_parallel_worker() doesn't consider the leader's participation.
> Actually, if we assume that compute_parallel_worker() is perfect, then
> surely parallel_leader_participation=off would beat
> parallel_leader_participation=on for CREATE INDEX -- it would allow us
> to use the value that compute_parallel_worker() truly intended. Which
> is the opposite of what you say about
> parallel_leader_participation=off above.
>
> I am only trying to understand your perspective here. I don't think
> that parallel_leader_participation support is that important. I think
> that parallel_leader_participation=off might be slightly useful as a
> way of discouraging parallel CREATE INDEX on smaller tables, just like
> it is for parallel sequential scan (though this hinges on specifically
> disallowing "degenerate parallel scan" cases). More often, it will
> make hardly any difference if parallel_leader_participation is on or
> off.
>
>> In other words, right now, parallel_leader_participation is not
>> strictly a testing GUC, but if we make CREATE INDEX respect it, then
>> we're pushing it towards being a GUC that you don't ever want to
>> enable except for testing. I'm still not sure that's a very good
>> idea, but if we're going to do it, then surely we should be
>> consistent.
>

I see your point. OTOH, I think we should have something for testing
purpose as that helps in catching the bugs and makes it easy to write
tests that cover worker part of the code.

>
> I'm confused. I *don't* want it to be something that you can only use
> for testing. I want to not hurt whatever case there is for the
> parallel_leader_participation GUC being something that a DBA may tune
> in production. I don't see the conflict here.
>
>> It's true that having one worker and no parallel leader
>> participation can never be better than just having the leader do it,
>> but it is also true that having two leaders and no parallel leader
>> participation can never be better than having 1 worker with leader
>> participation. I don't see a reason to treat those cases differently.
>
> You must mean "having two workers and no parallel leader participation...".
>
> The reason to treat those two cases differently is simple: One
> couldn't possibly be desirable in production, and undermines the whole
> idea of parallel_leader_participation being user visible by adding a
> sharp edge. The other is likely to be pretty harmless, especially
> because leader participation is generally pretty fudged, and our cost
> model is fairly rough. The difference here isn't what is important;
> avoiding doing something that we know couldn't possibly help under any
> circumstances is important. I think that we should do that on general
> principle.
>
> As I said in a prior e-mail, even parallel query's use of
> parallel_leader_participation is consistent with what I propose here,
> practically speaking, because a partial path without leader
> participation will always lose to a serial sequential scan path in
> practice. The fact that the optimizer will create a partial path that
> makes a useless "degenerate parallel scan" a *theoretical* possibility
> is irrelevant, because the optimizer has its own way of making sure
> that such a plan doesn't actually get picked. It has its way, and so I
> must have my own.
>

Can you please elaborate what part of optimizer are you talking about
where without leader participation partial path will always lose to a
serial sequential scan path?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2018-01-18 14:21:46 Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
Previous Message Ryan Murphy 2018-01-18 14:04:45 Re: Add default role 'pg_access_server_files'