Re: Regression in join selectivity estimations when using foreign keys

From: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Regression in join selectivity estimations when using foreign keys
Date: 2017-05-22 04:10:20
Message-ID: CAKJS1f9QGqqm8BX6WceKB4RA3ADVH1N6rU83GkSRZJcv+9B-yw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 21 May 2017 at 07:56, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> I'm entirely unconvinced by this patch --- it seems to simply be throwing
> away a lot of logic. Notably it lobotomizes the FK code altogether for
> semi/antijoin cases, but you've not shown any example that even involves
> such joins, so what's the argument for doing that? Also, the reason
> we had the use_smallest_selectivity code in the first place was that we
> didn't believe the FK-based selectivity could be applied safely to
> outer-join cases, so simply deciding that it's OK to apply it after all
> seems insufficiently justified.

Originally I looked at just multiplying the selectivities in
use_smallest_selectivity, but on further thought, I didn't manage to
see why using foreign keys to assist with selectivity estimations when
the ref_is_outer is true. We still have a very high probability that
the outer rel contains a tuple to match each inner rel's tuples
(because inner references outer). The only difference between OUTER
and INNER typed joins is that OUTER will produce a bunch of NULL rows
in place of non-matched inner rows. That part seems to be handled in
calc_joinrel_size_estimate() by code such as:

if (nrows < outer_rows)
nrows = outer_rows;

We could do much better than what we have today by also adding
outer_rows - (n_distinct inner rows on referencing Vars), to also
account for those unmatched rows. However, I'd say that's not for
back-patching, although it may be especially good now that we have
multi-variate ndistinct in PG10.

> Or in short, exhibiting one counterexample to the existing code is not
> a sufficient argument for changing things this much. You need to give
> an argument why this is the right thing to do instead.
>
> Stepping back a bit, it seems like the core thing the FK selectivity code
> was meant to do was to prevent us from underestimating selectivity of
> multiple-clause join conditions through a false assumption of clause
> independence. The problem with the use_smallest_selectivity code is that
> it's overestimating selectivity, but that doesn't mean that we want to go
> all the way back to the old way of doing things. I wonder if we could get
> any traction in these dubious cases by computing the product, instead of
> minimum, of the clause selectivities (thus matching the old estimate) and
> then taking the greater of that and the FK-based selectivity.

My concern with going back to selectivity multiplication was exactly
because I didn't want to go back to the original undesired behaviour
of underestimation when conditions are co-related. I'm unsure why
taking the greater is any better than just using the foreign key
selectivity. It seems senseless to use clause based selectivities at
all when we've proved the foreign key exists -- there will be no
unmatched inner tuples.

The reason I dropped the non-singleton join for SEMI and ANTI is that
I couldn't see how we could make any improvements over the original
join estimation code for this case. Perhaps I'm assuming too much
about how get_foreign_key_join_selectivity() is used by doing this?
I'm assuming that leaving the clause list intact and referring the
whole case to calc_joinrel_size_estimate() is okay to do.

The reason I added the nullfrac code in 0002 was because I was
concerned with regressing OUTER join cases where the nulfrac works due
to using the clause_selectivity(). Although this case regressed
between 9.5 and 9.6 for INNER joins with a non-zero nullfrac on the
join condition.

In short; if we replace the use_smallest_selectivity code with fkselec
*= clauselist_selectivity(root, removedlist, 0, jointype, sjinfo);
then I'm worried about regressing back to the old underestimations.
The extended stats code in PG10's version of clauselist_selectivity()
will ignore these join conditions, so nothing can be done to assist
the underestmations by adding extended stats yet.

I also just noticed that I don't think I've got ANTI join cases
correct in the patch I sent. I'll look at that now.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro HORIGUCHI 2017-05-22 04:12:14 Re: asynchronous execution
Previous Message Chapman Flack 2017-05-22 03:01:57 Re: PG10 Crash-safe and replicable Hash Indexes and UNIQUE