Quick Links

Re: Additional improvements to extended statistics

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Additional improvements to extended statistics
Date:	2020-03-21 21:59:46
Message-ID:	20200321215946.epx3dfi74stlwiab@development
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Thu, Mar 19, 2020 at 07:08:07PM +0000, Dean Rasheed wrote:
>On Wed, 18 Mar 2020 at 19:31, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>
>> Attached is a rebased patch series, addressing both those issues.
>>
>> I've been wondering why none of the regression tests failed because of
>> the 0.0 vs. 1.0 issue, but I think the explanation is pretty simple - to
>> make the tests stable, all the MCV lists we use are "perfect" i.e. it
>> represents 100% of the data. But this selectivity is used to compute
>> selectivity only for the part not represented by the MCV list, i.e. it's
>> not really used. I suppose we could add a test that would use larger
>> MCV item, but I'm afraid that'd be inherently unstable :-(
>>
>
>I think it ought to be possible to write stable tests for this code
>branch -- I think all you need is for the number of rows to remain
>small, so that the stats sample every row and are predictable, while
>the MCVs don't cover all values, which can be achieved with a mix of
>some common values, and some that only occur once.
>
>I haven't tried it, but it seems like it would be possible in principle.
>

Ah, right. Yeah, I think that should work. I thought there would be some
volatility due to groups randomly not making it into the MCV list, but
you're right it's possible to construct the data in a way to make it
perfectly deterministic. So that's what I've done in the attached patch.

>> Another thing I was thinking about is the changes to the API. We need to
>> pass information whether the clauses are connected by AND or OR to a
>> number of places, and 0001 does that in two ways. For some functions it
>> adds a new parameter (called is_or), and for other functiosn it creates
>> a new copy of a function. So for example
>>
>> - statext_mcv_clauselist_selectivity
>> - statext_clauselist_selectivity
>>
>> got the new flag, while e.g. clauselist_selectivity gets a new "copy"
>> sibling called clauselist_selectivity_or.
>>
>> There were two reasons for not using flag. First, clauselist_selectivity
>> and similar functions have to do very different stuff for these two
>> cases, so it'd be just one huge if/else block. Second, minimizing
>> breakage of third-party code - pretty much all the extensions I've seen
>> only work with AND clauses, and call clauselist_selectivity. Adding a
>> flag would break that code. (Also, there's a bit of laziness, because
>> this was the simplest thing to do during development.)
>>
>> But I wonder if that's sufficient reason - maybe we should just add the
>> flag in all cases. It might break some code, but the fix is trivial (add
>> a false there).
>>
>> Opinions?
>>
>
>-1
>
>I think of clause_selectivity() and clauselist_selectivity() as the
>public API that everyone is using, whilst the functions that support
>lists of clauses to be combined using OR are internal (to the planner)
>implementation details. I think callers of public API tend to either
>have implicitly AND'ed list of clauses, or a single OR clause.
>

OK, thanks. That was mostly my reasoning too - not wanting to cause
unnecessary breakage. And yes, I agree most people just call
clauselist_selectivity with a list of clauses combined using AND.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment	Content-Type	Size
0001-Improve-estimation-of-OR-clauses-using-exte-20200321.patch	text/plain	26.3 KB

In response to

Re: Additional improvements to extended statistics at 2020-03-19 19:08:07 from Dean Rasheed

Responses

Re: Additional improvements to extended statistics at 2020-03-23 08:21:42 from Dean Rasheed

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Floris Van Nee	2020-03-21 22:00:01	RE: Index Skip Scan
Previous Message	Pavel Stehule	2020-03-21 20:29:21	Re: plan cache overhead on plpgsql expression