Re: Re: fix cost subqueryscan wrong parallel cost

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: "bucoo(at)sohu(dot)com" <bucoo(at)sohu(dot)com>
Cc: Richard Guo <guofenglinux(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Re: fix cost subqueryscan wrong parallel cost
Date: 2022-04-18 13:44:49
Message-ID: CA+TgmobxhwpZJDQgbtQOGe8+DfDnSsmf7JqPmtbitTK1vnRdpQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Apr 15, 2022 at 6:06 AM bucoo(at)sohu(dot)com <bucoo(at)sohu(dot)com> wrote:
> > Generally it should be. But there's no subquery scan visible here.
> I wrote a patch for distinct/union and aggregate support last year(I want restart it again).
> https://www.postgresql.org/message-id/2021091517250848215321%40sohu.com
> If not apply this patch, some parallel paths will naver be selected.

Sure, but that doesn't make the patch correct. The patch proposes
that, when parallelism in use, a subquery scan will produce fewer rows
than when parallelism is not in use, and that's 100% false. Compare
this with the case of a parallel sequential scan. If a table contains
1000 rows, and we scan it with a regular Seq Scan, the Seq Scan will
return 1000 rows. But if we scan it with a Parallel Seq Scan using
say 4 workers, the number of rows returned in each worker will be
substantially less than 1000, because 1000 is now the *total* number
of rows to be returned across *all* processes, and what we need is the
number of rows returned in *each* process.

The same thing isn't true for a subquery scan. Consider:

Gather
-> Subquery Scan
-> Parallel Seq Scan

One thing is for sure: the number of rows that will be produced by the
subquery scan in each backend is exactly equal to the number of rows
that the subquery scan receives from its subpath. Parallel Seq Scan
can't just return a row count estimate based on the number of rows in
the table, because those rows are going to be divided among the
workers. But the Subquery Scan doesn't do anything like that. If it
receives let's say 250 rows as input in each worker, it's going to
produce 250 output rows in each worker. Your patch says it's going to
produce fewer than that, and that's wrong, regardless of whether it
gives you the plan you want in this particular case.

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2022-04-18 13:53:42 Re: BufferAlloc: don't take two simultaneous locks
Previous Message Mohammad Zain Abbas 2022-04-18 13:40:23 GSoC: Database Load Stress Benchmark (2022)