Re: Additional improvements to extended statistics

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Daniel Gustafsson <daniel(at)yesql(dot)se>
Cc: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Additional improvements to extended statistics
Date: 2020-11-12 14:18:24
Message-ID: 57eb142c-d5a4-cfca-675b-0dfd9ef7c2e2@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Here is an improved WIP version of the patch series, modified to address
the issue with repeatedly applying the extended statistics, as discussed
with Dean in this thread. It's a bit rough and not committable, but I
need some feedback so I'm posting it in this state.

(Note: The WIP patch is expected to fail regression tests. A couple
stats_ext regression tests fail due to changed estimate - I've left it
like that to make the changes more obvious for now.)

Earlier in this thread I used this example:

CREATE TABLE t (a int, b int);
INSERT INTO t SELECT mod(i, 10), mod(i, 10)
FROM generate_series(1,100000) s(i);
CREATE STATISTICS s (mcv) ON a,b FROM t;
ANALYZE t;

EXPLAIN SELECT * FROM t WHERE a = 0 OR b = 0;

which had this call graph with two statext_mcv_clauselist_selectivity
calls (which was kinda the issue):

clauselist_selectivity
statext_clauselist_selectivity
statext_mcv_clauselist_selectivity <--- (1)
clauselist_selectivity_simple
clause_selectivity
clauselist_selectivity_or
statext_clauselist_selectivity
statext_mcv_clauselist_selectivity <--- (2)
clauselist_selectivity_simple_or
clause_selectivity
clause_selectivity
mcv_clauselist_selectivity
clauselist_selectivity_simple_or
mcv_clauselist_selectivity
clauselist_selectivity_simple
(already estimated)

with the patches applied, the call looks like this:

clauselist_selectivity_internal (use_extended_stats=1)
statext_clauselist_selectivity
statext_mcv_clauselist_selectivity (is_or=0)
clauselist_selectivity_simple
clause_selectivity_internal (use_extended_stats=0)
clauselist_selectivity_or (use_extended_stats=0)
clauselist_selectivity_simple_or
clause_selectivity_internal (use_extended_stats=0)
clause_selectivity_internal (use_extended_stats=0)
mcv_clauselist_selectivity (is_or=0)
clauselist_selectivity_simple

The nested call is removed, which I think addresses the issue. As for
the effect on estimates, there's a couple regression tests where the
estimates change - not much though, an example is:

SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial
WHERE a = 0 OR b = 0 OR c = 10');
estimated | actual
-----------+--------
- 412 | 104
+ 308 | 104
(1 row)

This is on top of 0001, though. Interestingly enough, this ends up with
the same estimate as current master, but I consider that a coincidence.

As for the patches:

0001 is the original patch improving estimates of OR clauses

0002 adds thin wrappers for clause[list]_selectivity, with "internal"
functions allowing to specify whether to keep considering extended stats

0003 does the same for the "simple" functions

I've kept it like this to demonstrate that 0002 is not sufficient. In my
response from March 24 I wrote this:

> Isn't it the case that clauselist_selectivity_simple (and the OR
> variant) should ignore extended stats entirely? That is, we'd need
> to add a flag (or _simple variant) to clause_selectivity, so that it
> calls causelist_selectivity_simple_or.
But that's actually wrong, as 0002 shows (as it breaks a couple of
regression tests), because of the way we handle OR clauses. At the top
level, an OR-clause is actually just a single clause and it may get
passed to clauselist_selectivity_simple. So entirely disabling extended
stats for the "simple" functions would also mean disabling extended
stats for a large number of OR clauses. Which is clearly wrong.

So 0003 addresses that, by adding a flag to the two "simple" functions.
Ultimately, this should probably do the same thing as 0002 and add thin
wrappers, because the existing functions are part of the public API.

Dean, does this address the issue you had in mind? Can you come up with
an example of that issue in the form of a regression test or something?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
0001-Improve-estimation-of-OR-clauses-using-exte-20201112.patch text/x-patch 26.2 KB
0002-WIP-don-t-use-extended-statistics-twice-20201112.patch text/x-patch 6.9 KB
0003-WIP-pass-flag-20201112.patch text/x-patch 4.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dean Rasheed 2020-11-12 14:26:29 Re: Additional improvements to extended statistics
Previous Message Eugen Konkov 2020-11-12 14:14:50 Re: Proposition for autoname columns