Re: fix cost subqueryscan wrong parallel cost

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>, Richard Guo <guofenglinux(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fix cost subqueryscan wrong parallel cost
Date: 2022-05-02 21:24:51
Message-ID: 1884952.1651526691@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> I am not sure whether this is actually correct, but it seems a lot
> more believable than the previous proposals. The problem might be more
> general, though. I think when I developed this parallel query stuff I
> modeled a lot of it on what you did for parameterized paths. Both
> parameterized paths and parallelism can create situations where
> executing a path to completion produces fewer rows than you would
> otherwise get. In the case of parameterized paths, this happens
> because we enforce the parameterization we've chosen on top of the
> user-supplied quals. In the case of parallelism, it happens because
> the rows are split up across the different workers. I think I intended
> that the "rows" field of RelOptInfo should be the row count for the
> relation in total, and that the "rows" field of the Path should be the
> number of rows we expect to get for one execution of the path. But it
> seems like this problem is good evidence that I didn't find all the
> places that need to be adjusted for parallelism, and I wouldn't be
> very surprised if there are a bunch of others that I overlooked.

I did look at the rest of costsize.c for similar instances, and didn't
find any. In any case, I think we have two options:

1. Apply this fix, and in future fix any other places that we identify
later.

2. Invent some entirely new scheme that we hope is less mistake-prone.

Option #2 is unlikely to lead to any near-term fix, and we certainly
wouldn't dare back-patch it.

> It's not actually very nice that we end up having to call
> clauselist_selectivity() here. We've already called
> set_baserel_size_estimates() to figure out how many rows we expect to
> have been filtered out by the quals, and it sucks to have to do it
> again. Brainstorming wildly and maybe stupidly, I wonder if the whole
> model is wrong here. Maybe a path shouldn't have a row count; instead,
> maybe it should have a multiplier that it applies to the relation's
> row count. Then, if X is parameterized in the same way as its subpath
> Y, we can just copy the multiplier up, but now it will be applied to
> the new rel's "rows" value, which will have already been adjusted
> appropriately by set_baserel_size_estimates().

I've wondered about that too, but it seems to depend on the assumption
that clauses are estimated independently by clauselist_selectivity, which
has not been true for a long time (and is getting less true not more so).
So we could possibly apply something like this for parallelism, but not
for parameterized paths, and that makes it less appealing ... IMO anyway.

I have thought it might be good to explicitly mark partial paths with the
estimated number of workers, which would be effectively the same thing
as what you're talking about. But I wonder if we'd not still be better off
keeping the path rowcount as being number-of-rows-in-each-worker, and
just scale it up by the multiplier for EXPLAIN output. (And then also
print the true total number of rows in EXPLAIN ANALYZE.) If we do the
inverse of that, then we risk bugs from failing to correct the rowcount
during cost-estimation calculations.

I kinda feel that the bottom line here is that cost estimation is
hard, and we're not going to find a magic bullet that removes bugs.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2022-05-02 21:48:24 Re: strange slow query - lost lot of time somewhere
Previous Message Robert Haas 2022-05-02 20:54:39 Re: fix cost subqueryscan wrong parallel cost