Re: Cost model for parallel CREATE INDEX

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Cost model for parallel CREATE INDEX
Date: 2017-03-09 01:33:09
Message-ID: CA+TgmoaTwOUtyOM9GLG4LpANY2qxzy9NZXRKbP0bB26jRgyW4w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Mar 5, 2017 at 7:14 PM, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> On Sat, Mar 4, 2017 at 2:15 PM, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
>> So, I agree with Robert that we should actually use heap size for the
>> main, initial determination of # of workers to use, but we still need
>> to estimate the size of the final index [1], to let the cost model cap
>> the initial determination when maintenance_work_mem is just too low.
>> (This cap will rarely be applied in practice, as I said.)
>>
>> [1] https://wiki.postgresql.org/wiki/Parallel_External_Sort#bt_estimated_nblocks.28.29_function_in_pageinspect
>
> Having looked at it some more, this no longer seems worthwhile. In the
> next revision, I will add a backstop that limits the use of
> parallelism based on a lack of maintenance_work_mem in a simpler
> manner. Namely, the worker will have to be left with a
> maintenance_work_mem/nworkers share of no less than 32MB in order for
> parallel CREATE INDEX to proceed. There doesn't seem to be any great
> reason to bring the volume of data to be sorted into it.

+1.

> I expect the cost model to be significantly simplified in the next
> revision in other ways, too. There will be no new index storage
> parameter, nor a disable_parallelddl GUC. compute_parallel_worker()
> will be called in a fairly straightforward way within
> plan_create_index_workers(), using heap blocks, as agreed to already.

+1.

> pg_restore will avoid parallelism (that will happen by setting
> "max_parallel_workers_maintenance = 0" when it runs), not because it
> cannot trust the cost model, but because it prefers to parallelize
> things its own way (with multiple restore jobs), and because execution
> speed may not be the top priority for pg_restore, unlike a live
> production system.

This part I'm not sure about. I think people care quite a lot about
pg_restore speed, because they are often down when they're running it.
And they may have oodles mode CPUs that parallel restore can use
without help from parallel query. I would be inclined to leave
pg_restore alone and let the chips fall where they may.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tsunakawa, Takayuki 2017-03-09 01:36:35 Re: Supporting huge pages on Windows
Previous Message Robert Haas 2017-03-09 01:30:18 Re: Print correct startup cost for the group aggregate.