Re: Assertion failure with LEFT JOINs among >500 relations

From: David Rowley <dgrowleyml(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Onder Kalaci <onderk(at)microsoft(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Assertion failure with LEFT JOINs among >500 relations
Date: 2020-10-13 09:10:15
Message-ID: CAApHDvowKW=Q7PTTNvHEeACfqttCkaY0QQt_cK0jpm+UMLnM+A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, 10 Oct 2020 at 02:19, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> I'm fairly certain that every one of the existing NaN checks was put
> there on the basis of hard experience. Possibly digging in the git
> history would offer more info about exactly where the NaNs came from.

I had a look at this and found there's been quite a number of fixes
which added either that isnan checks or the <= 0 checks.

Namely:

-----------
commit 72826fb362c4aada6d2431df0b706df448806c02
Author: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Date: Fri Apr 15 17:45:41 2011 -0400

Guard against incoming rowcount estimate of NaN in cost_mergejoin().

Although rowcount estimates really ought not be NaN, a bug elsewhere
could perhaps result in that, and that would cause Assert failure in
cost_mergejoin, which I believe to be the explanation for bug #5977 from
Anton Kuznetsov. Seems like a good idea to expend a couple more cycles
to prevent that, even though the real bug is elsewhere. Not back-patching,
though, because we don't encourage running production systems with
Asserts on.

The discussion for that is in
https://www.postgresql.org/message-id/flat/4602.1302705756%40sss.pgh.pa.us#69dd8c334aa714cfac4e0d9b04c5201c

commit 76281aa9647e6a5dfc646514554d0f519e3b8a58
Author: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Date: Sat Mar 26 12:03:12 2016 -0400

Avoid a couple of zero-divide scenarios in the planner.

commit fd791e7b5a1bf53131ad15e68e4d4f8ca795fcb4
Author: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Date: Mon Mar 24 21:53:04 2008 +0000

When a relation has been proven empty by constraint exclusion,
propagate that
knowledge up through any joins it participates in. We were doing
that already
in some special cases but not in the general case. Also, defend
against zero
row estimates for the input relations in cost_mergejoin --- this
fix may have
eliminated the only scenario in which that can happen, but be safe. Per
report from Alex Solovey.

That was reported in
https://www.postgresql.org/message-id/flat/BLU136-DAV79FF310AC13FFC96FA2FDAEFD0%40phx.gbl#4cde17b2369fc7e0da83cc7d4aeeaa48

The problem was that an Append with no subpaths could have a 0 row estimate.
-----------

Because there's been quite a few of these, and this report is yet
another one, I wonder if it's time to try and stamp these out at the
source rather than where the row counts are being used.

I toyed around with the attached patch, but I'm still not that excited
about the clamping of infinite values to DBL_MAX. The test case I
showed above with generate_Series(1,379) still ends up with NaN cost
estimates due to costing a sort with DBL_MAX rows. When I was writing
the patch, I had it in my head that the costs per row will always be
lower than 1. I thought because of that that even if the row count is
dangerously close to DBL_MAX, the costs will never be higher than the
row count... Turns out, I was wrong about that as clearly sorting a
number of rows even close to DBL_MAX would beyond astronomically
expensive and cause the costs would go infinite.

The fd791e7b5 fix was for a subpath-less Append node having a 0-row
estimate and causing problems in the costing of merge join. In the
patch, I thought it would be better just to fix this by insisting that
Append always will have at least 1 row. That means even a dummy path
would have 1 row, which will become a const-false Result in the plan.
I've had to add a special case to set the plan_rows back to 0 so that
EXPLAIN shows 0 rows as it did before. That's not exactly pretty, but
I still feel there is merit in insisting we never have 0-row paths to
get away from these types of bugs once at for all.

The patch does fix the failing Assert. However, something along these
lines seems more suitable for master only. The back branches maybe
should just get a more localised isinf() check and clamp to DBL_MAX
that I mentioned earlier in this thread.

I've searched through the code to see if there are other possible
cases where paths may be generated with a 0-row count. I imagine
anything that has a qual and performs a selectivity estimate will
already have a clamp_row_est() since we'd see fractional row counts if
it didn't. That leaves me with Append / Merge Append and each join
type + aggregates. Currently, it seems we never will generate a Merge
Append without any sub-paths. I wondered if I should just Assert
that's the case in create_merge_append_path(). I ended up just adding
a clamp_row_est call instead. calc_joinrel_size_estimate() seems to
handle all join path row estimates. That uses clamp_row_est.
Aggregate paths can reduce the number of rows, but I think all the row
estimates from those will go through estimate_num_groups(), which
appears to never be able to return 0.

David

Attachment Content-Type Size
fix_infinite_row_estimates.patch text/plain 5.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Noah Misch 2020-10-13 09:48:08 Re: Weird corner-case failure mode for VPATH builds
Previous Message Luc Vlaming 2020-10-13 08:57:23 allow partial union-all and improve parallel subquery costing