Quick Links

Re: Parallel INSERT (INTO ... SELECT ...)

From:	Greg Nancarrow <gregn4422(at)gmail(dot)com>
To:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Parallel INSERT (INTO ... SELECT ...)
Date:	2020-10-09 10:57:38
Message-ID:	CAJcOf-df_+NxtNJYeGmHSm6nQNt-ecD=PfCzitMrb2WGt4+_bw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Fri, Oct 9, 2020 at 8:41 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Fri, Oct 9, 2020 at 2:37 PM Greg Nancarrow <gregn4422(at)gmail(dot)com> wrote:
> >
> > Speaking of costing, I'm not sure I really agree with the current
> > costing of a Gather node. Just considering a simple Parallel SeqScan
> > case, the "run_cost += parallel_tuple_cost * path->path.rows;" part of
> > Gather cost always completely drowns out any other path costs when a
> > large number of rows are involved (at least with default
> > parallel-related GUC values), such that Parallel SeqScan would never
> > be the cheapest path. This linear relationship in the costing based on
> > the rows and a parallel_tuple_cost doesn't make sense to me. Surely
> > after a certain amount of rows, the overhead of launching workers will
> > be out-weighed by the benefit of their parallel work, such that the
> > more rows, the more likely a Parallel SeqScan will benefit.
> >
>
> That will be true for the number of rows/pages we need to scan not for
> the number of tuples we need to return as a result. The formula here
> considers the number of rows the parallel scan will return and the
> more the number of rows each parallel node needs to pass via shared
> memory to gather node the more costly it will be.
>
> We do consider the total pages we need to scan in
> compute_parallel_worker() where we use a logarithmic formula to
> determine the number of workers.
>

Despite all the best intentions, the current costings seem to be
geared towards selection of a non-parallel plan over a parallel plan,
the more rows there are in the table. Yet the performance of a
parallel plan appears to be better than non-parallel-plan the more
rows there are in the table.
This doesn't seem right to me. Is there a rationale behind this costing model?
I have pointed out the part of the parallel_tuple_cost calculation
that seems to drown out all other costs (causing the cost value to be
huge), the more rows there are in the table.

Regards,
Greg Nancarrow
Fujitsu Australia

In response to

Re: Parallel INSERT (INTO ... SELECT ...) at 2020-10-09 09:42:05 from Amit Kapila

Responses

Re: Parallel INSERT (INTO ... SELECT ...) at 2020-10-09 12:04:38 from Amit Kapila
Re: Parallel INSERT (INTO ... SELECT ...) at 2020-10-09 20:39:15 from Thomas Munro

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Masahiko Sawada	2020-10-09 11:01:23	Re: Transactions involving multiple postgres foreign servers, take 2
Previous Message	Amit Kapila	2020-10-09 10:57:19	Re: Parallel copy