Re: Use extended statistics to estimate (Var op Var) clauses

From: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Use extended statistics to estimate (Var op Var) clauses
Date: 2022-07-22 13:17:47
Message-ID: CAEZATCWjQcp_xNaaEKjnBJLMKcHzjnHpy83QxyYSak2zcE63yA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, 21 Jul 2022 at 12:42, Tomas Vondra
<tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>
> > This needs to account for nullfrac, since x = x is only true if x is not null.
>
> Right, I forgot to account for nullfrac.
>

Ditto variable <= variable

> > I don't like how matching_restriction_variables() is adding a
> > non-trivial amount of overhead to each of these selectivity functions,
> > by calling examine_variable() for each side, duplicating what
> > get_restriction_variable() does later on.
>
> But matching_restriction_variables() does not use examine_variable()
> anymore. It did, originally, but 0002 removed all of that. Now it does
> just pull_varnos() and that's it. I admit leaving those bits unsquashed
> was a bit confusing, but the first part was really 0001+0002+0003.
>

Ah, right, I only looked at 0001, because I thought that was the main
part that I hadn't previously looked at.

So my previous concern with matching_restriction_variables() was how
many extra cycles it was adding to test for what should be a pretty
uncommon case. Looking at the new version, I think it's a lot better,
but perhaps it would be more efficient to call equal() as soon as
you've extracted "left" and "right", so it can bail out earlier and
faster when they're not equal. I think a call to equal() should be
very fast compared to pull_varnos() and contain_volatile_functions().

> Attached is a rebased and somewhat cleaned up version of the patch
> series, addressing the review comments so far and squashing the bits I
> previously kept separate to showcase the changes.
>
> I've also added a bunch of regression tests - queries with (Var op Var)
> clauses of varying complexity, to demonstrate the effect of each patch.
> I added them as 0001, so it's clear how the individual patches affect
> the results.
>

Cool. That's much clearer, and the results look quite good.

The main thing that jumps out at me now is the whole "issimple"
processing stuff. Those changes turned out to be a lot more invasive
than I thought. I don't think this part is correct:

/*
* All AND/OR clauses are considered complex, even if all arguments are
* simple clauses. For NOT clauses we need to check the argument and then
* we can update the flag.
*
* XXX Maybe for AND/OR we should check if all arguments reference the
* same attnum, and consider them complex only when there are multiple
* attnum values (i.e. different Vars)?
*/

I think the XXX part of that comment is right, and that's what the
original code did.

I had to go remind myself what "simple" was intended for, so apologies
if this is really pedestrian:

The basic idea of a "simple" clause was meant to mean any clause
that's likely to be better estimated by standard statistics, rather
than extended statistics (and "issimple" only applies when such
clauses are combined using "OR"). So, for example, given a query with

WHERE a = 1 OR (b > 0 AND b < 10)

both "a = 1" and "b > 0 AND b < 10" should be considered "simple",
since the standard statistics code is likely to estimate them quite
well. For the second clause, it might use histograms and the
RangeQueryClause-machinery, which ought to work well, whereas applying
a multivariate MCV list to "b > 0 AND b < 10" in isolation would
probably make its estimate worse.

So then, in this example, what the "OR" handling in
statext_mcv_clauselist_selectivity() would end up doing is:

P(a = 1 OR (b > 0 AND b < 10))
= P(a = 1)
+ P(b > 0 AND b < 10)
- P(a = 1 AND b > 0 AND b < 10) # Overlap

and only use extended statistics for the overlap term, since the other
2 clauses are "simple", and best estimated without extended stats. The
patch changes that, which ends up making things worse in some cases.

Is it the case that the only reason for changing the "issimple"
handling was because the standard statistics code didn't work well for
things like "a < a", and so we wanted to treat that as not-simple? If
so, given the selfuncs improvements, perhaps that's no longer
necessary, and the original definition of "simple" is OK. IOW, is 0004
necessary, given 0002?.

I notice that 0004 didn't change any test results, so if there are
cases where it improves things, they aren't tested.

Here's a simple example using the WHERE clause above. Without the
patch, extended stats improved the estimate, but with the patch they
don't anymore:

DROP TABLE IF EXISTS foo;
CREATE TABLE foo (a int, b int);
INSERT INTO foo SELECT x/10+1, x FROM generate_series(1,10000) g(x);
ANALYSE foo;
EXPLAIN ANALYSE SELECT * FROM foo WHERE a = 1 OR (b > 0 AND b < 10);

Estimated: 18
Actual: 9

CREATE STATISTICS foo_s (mcv) ON a,b FROM foo;
ANALYSE foo;
EXPLAIN ANALYSE SELECT * FROM foo WHERE a = 1 OR (b > 0 AND b < 10);

Estimated: 9 without the patch, 18 with the patch
Actual: 9

It's a worry that none of the existing regression tests picked up on
that. Perhaps a similar test could be added using the existing test
data. Otherwise, I think it'd be worth adding a new test with similar
data to the above.

Regards,
Dean

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Langote 2022-07-22 13:22:48 Re: Skip partition tuple routing with constant partition key
Previous Message Matthias van de Meent 2022-07-22 13:16:56 Re: Pluggable toaster