Re: Additional improvements to extended statistics

From: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Daniel Gustafsson <daniel(at)yesql(dot)se>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Additional improvements to extended statistics
Date: 2020-11-29 14:57:42
Message-ID: CAEZATCVi6Lxooa8YFq5F6tkeu9kB_o3SUqjo95_kuSs2Sq-rHQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On Wed, 18 Nov 2020 at 22:37, Tomas Vondra
> <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
> >
> > Seems fine to me, although the "_opt_ext_stats" is rather cryptic.
> > AFAICS we use "_internal" for similar functions.
> >

I have been thinking about this some more. The one part of this that I
still wasn't happy with was the way that base frequencies were used to
compute the selectivity correction to apply. As noted in [1], using
base frequencies in this way isn't really appropriate for clauses
combined using "OR". The reason is that an item's base frequency is
computed as the product of the per-column selectivities, so that (freq
- base_freq) is the right correction to apply for a set of clauses
combined with "AND", but it doesn't really work properly for clauses
combined with "OR". This is why a number of the estimates in the
regression tests end up being significant over-estimates.

I speculated in [1] that we might fix that by tracking which columns
of the match bitmap actually matched the clauses being estimated, and
then only use those base frequencies. Unfortunately that would also
mean changing the format of the stats that we store, and so would be a
rather invasive change.

It occurred to me though, that there is another, much more
straightforward way to do it. We can rewrite the "OR" clauses, and
turn them into "AND" clauses using the fact that

P(A OR B) = P(A) + P(B) - P(A AND B)

and then use the multivariate stats to estimate the P(A AND B) part in
the usual way.

Attached is the resulting patch doing it that way. The main change is
in the way that statext_mcv_clauselist_selectivity() works, combined
with a new function mcv_clause_selectivity_or() that does the
necessary MCV bitmap manipulations.

Doing it this way also means that clausesel.c doesn't need to export
clauselist_selectivity_or(), and the new set of exported functions
seem a bit neater now.

A handful of regression test results change, and in all cases except
one the new estimates are much better. One estimate is made worse, but
in that case we only have 2 sets of partial stats:

SELECT * FROM mcv_lists_multi WHERE a = 0 OR b = 0 OR c = 0 OR d = 0

with stats on (a,b) and (c,d) so it's not surprising that combining (a
= 0 OR b = 0) with (c = 0 OR d = 0) mis-estimates a bit. I suspect the
old MV stats estimate was more down to chance in this case.

Regards,
Dean

[1] https://www.postgresql.org/message-id/CAEZATCX8u9bZzcWEzqA_t7f_OQHu2oxeTUGnFHNEOXnJo35AQg%40mail.gmail.com

Attachment Content-Type Size
improve-estimation-of-OR-clauses-20201129.patch text/x-patch 36.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2020-11-29 16:42:22 Re: VACUUM (DISABLE_PAGE_SKIPPING on)
Previous Message James Coleman 2020-11-29 14:44:33 Re: Why does create_gather_merge_plan need make_sort?