| From: | Greg Nancarrow <gregn4422(at)gmail(dot)com> |
|---|---|
| To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
| Cc: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
| Subject: | Re: Parallel INSERT (INTO ... SELECT ...) |
| Date: | 2020-10-09 10:57:38 |
| Message-ID: | CAJcOf-df_+NxtNJYeGmHSm6nQNt-ecD=PfCzitMrb2WGt4+_bw@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Fri, Oct 9, 2020 at 8:41 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Fri, Oct 9, 2020 at 2:37 PM Greg Nancarrow <gregn4422(at)gmail(dot)com> wrote:
> >
> > Speaking of costing, I'm not sure I really agree with the current
> > costing of a Gather node. Just considering a simple Parallel SeqScan
> > case, the "run_cost += parallel_tuple_cost * path->path.rows;" part of
> > Gather cost always completely drowns out any other path costs when a
> > large number of rows are involved (at least with default
> > parallel-related GUC values), such that Parallel SeqScan would never
> > be the cheapest path. This linear relationship in the costing based on
> > the rows and a parallel_tuple_cost doesn't make sense to me. Surely
> > after a certain amount of rows, the overhead of launching workers will
> > be out-weighed by the benefit of their parallel work, such that the
> > more rows, the more likely a Parallel SeqScan will benefit.
> >
>
> That will be true for the number of rows/pages we need to scan not for
> the number of tuples we need to return as a result. The formula here
> considers the number of rows the parallel scan will return and the
> more the number of rows each parallel node needs to pass via shared
> memory to gather node the more costly it will be.
>
> We do consider the total pages we need to scan in
> compute_parallel_worker() where we use a logarithmic formula to
> determine the number of workers.
>
Despite all the best intentions, the current costings seem to be
geared towards selection of a non-parallel plan over a parallel plan,
the more rows there are in the table. Yet the performance of a
parallel plan appears to be better than non-parallel-plan the more
rows there are in the table.
This doesn't seem right to me. Is there a rationale behind this costing model?
I have pointed out the part of the parallel_tuple_cost calculation
that seems to drown out all other costs (causing the cost value to be
huge), the more rows there are in the table.
Regards,
Greg Nancarrow
Fujitsu Australia
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Masahiko Sawada | 2020-10-09 11:01:23 | Re: Transactions involving multiple postgres foreign servers, take 2 |
| Previous Message | Amit Kapila | 2020-10-09 10:57:19 | Re: Parallel copy |