Re: Additional improvements to extended statistics

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Additional improvements to extended statistics
Date: 2020-03-24 01:08:38
Message-ID: 20200324010838.hbv5gzhexcv5cilg@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Mar 23, 2020 at 08:21:42AM +0000, Dean Rasheed wrote:
>On Sat, 21 Mar 2020 at 21:59, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>
>> Ah, right. Yeah, I think that should work. I thought there would be some
>> volatility due to groups randomly not making it into the MCV list, but
>> you're right it's possible to construct the data in a way to make it
>> perfectly deterministic. So that's what I've done in the attached patch.
>>
>
>Looking through those new tests, another issue occurred to me -- under
>some circumstances this patch can lead to extended stats being applied
>twice to the same clause, which is not great, because it involves
>quite a lot of extra work, and also because it can lead to
>overestimates because of the way that MCV stats are applied as a delta
>correction to simple_sel.
>
>The way this comes about is as follows -- if we start with an OR
>clause, that will be passed as a single-item implicitly AND'ed list to
>clauselist_selectivity(), and from there to
>statext_clauselist_selectivity() and then to
>statext_mcv_clauselist_selectivity(). This will call
>clauselist_selectivity_simple() to get simple_sel, before calling
>mcv_clauselist_selectivity(), which will recursively compute all the
>MCV corrections. However, the call to clauselist_selectivity_simple()
>will call clause_selectivity() for each clause (just a single OR
>clause in this example), which will now call
>clauselist_selectivity_or(), which will go back into
>statext_clauselist_selectivity() with "is_or = true", which will apply
>the MCV corrections to the same set of clauses that the outer call was
>about to process.
>

Hmmm. So let's consider a simple OR clause with two arguments, both
covered by single statistics object. Something like this:

CREATE TABLE t (a int, b int);
INSERT INTO t SELECT mod(i, 10), mod(i, 10)
FROM generate_series(1,100000);
CREATE STATISTICS s (mcv) ON a,b FROM t;

SELECT * FROM t WHERE a = 0 OR b = 0;

Which is estimated correctly, but the calls graph looks like this:

clauselist_selectivity
statext_clauselist_selectivity
statext_mcv_clauselist_selectivity <---
clauselist_selectivity_simple
clause_selectivity
clauselist_selectivity_or
statext_clauselist_selectivity
statext_mcv_clauselist_selectivity <---
clauselist_selectivity_simple_or
clause_selectivity
clause_selectivity
mcv_clauselist_selectivity
clauselist_selectivity_simple_or
mcv_clauselist_selectivity
clauselist_selectivity_simple
(already estimated)

IIUC the problem you have in mind is that we end up calling
statext_mcv_clauselist_selectivity twice, once for the top-level AND
clause with a single argument, and then recursively for the expanded OR
clause. Indeed, that seems to be applying the correction twice.

>I'm not sure what's the best way to resolve that. Perhaps
>statext_mcv_clauselist_selectivity() / statext_is_compatible_clause()
>should ignore OR clauses from an AND-list, on the basis that they will
>get processed recursively later. Or perhaps estimatedclauses can
>somehow be used to prevent this, though I'm not sure exactly how that
>would work.

I don't know. I feel uneasy about just ignoring some of the clauses,
because what happens for complex clauses, where the OR is not directly
in the AND clause, but is negated or something like that?

Isn't it the case that clauselist_selectivity_simple (and the OR
variant) should ignore extended stats entirely? That is, we'd need to
add a flag (or _simple variant) to clause_selectivity, so that it calls
causelist_selectivity_simple_or. So the calls would look like this:

clauselist_selectivity
statext_clauselist_selectivity
statext_mcv_clauselist_selectivity
clauselist_selectivity_simple <--- disable extended stats
clause_selectivity
clauselist_selectivity_simple_or
clause_selectivity
clause_selectivity
mcv_clauselist_selectivity
clauselist_selectivity_simple
already estimated

I've only quickly hacked clause_selectivity, but it does not seems very
invasive (of course, it means disruption to clause_selectivity callers,
but I suppose most call clauselist_selectivity).

BTW do you have an example where this would actually cause an issue?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2020-03-24 01:41:50 Re: pgsql: Improve autovacuum logging for aggressive and anti-wraparound ru
Previous Message Michael Paquier 2020-03-24 01:04:24 Re: Define variables in the approprieate scope