Re: Additional improvements to extended statistics

From: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Additional improvements to extended statistics
Date: 2020-03-19 19:08:07
Message-ID: CAEZATCVuV+UWp9v51BRbrXibUOSWFpQeeZkZTs-=G6d0dayZBQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, 18 Mar 2020 at 19:31, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
> Attached is a rebased patch series, addressing both those issues.
>
> I've been wondering why none of the regression tests failed because of
> the 0.0 vs. 1.0 issue, but I think the explanation is pretty simple - to
> make the tests stable, all the MCV lists we use are "perfect" i.e. it
> represents 100% of the data. But this selectivity is used to compute
> selectivity only for the part not represented by the MCV list, i.e. it's
> not really used. I suppose we could add a test that would use larger
> MCV item, but I'm afraid that'd be inherently unstable :-(
>

I think it ought to be possible to write stable tests for this code
branch -- I think all you need is for the number of rows to remain
small, so that the stats sample every row and are predictable, while
the MCVs don't cover all values, which can be achieved with a mix of
some common values, and some that only occur once.

I haven't tried it, but it seems like it would be possible in principle.

> Another thing I was thinking about is the changes to the API. We need to
> pass information whether the clauses are connected by AND or OR to a
> number of places, and 0001 does that in two ways. For some functions it
> adds a new parameter (called is_or), and for other functiosn it creates
> a new copy of a function. So for example
>
> - statext_mcv_clauselist_selectivity
> - statext_clauselist_selectivity
>
> got the new flag, while e.g. clauselist_selectivity gets a new "copy"
> sibling called clauselist_selectivity_or.
>
> There were two reasons for not using flag. First, clauselist_selectivity
> and similar functions have to do very different stuff for these two
> cases, so it'd be just one huge if/else block. Second, minimizing
> breakage of third-party code - pretty much all the extensions I've seen
> only work with AND clauses, and call clauselist_selectivity. Adding a
> flag would break that code. (Also, there's a bit of laziness, because
> this was the simplest thing to do during development.)
>
> But I wonder if that's sufficient reason - maybe we should just add the
> flag in all cases. It might break some code, but the fix is trivial (add
> a false there).
>
> Opinions?
>

-1

I think of clause_selectivity() and clauselist_selectivity() as the
public API that everyone is using, whilst the functions that support
lists of clauses to be combined using OR are internal (to the planner)
implementation details. I think callers of public API tend to either
have implicitly AND'ed list of clauses, or a single OR clause.

Regards,
Dean

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Julien Rouhaud 2020-03-19 19:12:47 Re: Collation versioning
Previous Message Mark Dilger 2020-03-19 18:47:46 Re: Adding missing object access hook invocations