Re: Selectivity estimation for intarray with @@

From: Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Uriy Zhuravlev <u(dot)zhuravlev(at)postgrespro(dot)ru>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Selectivity estimation for intarray with @@
Date: 2015-07-21 16:28:01
Message-ID: CAPpHfdsxU-V9AB1-PrSUQc42n6WR44vciZFCTiVa2ZOci1vJ0w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jul 21, 2015 at 6:49 PM, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:

> On 07/21/2015 03:44 PM, Alexander Korotkov wrote:
>
>> While Uriy is on vacation, I've revised this patch a bit.
>>
>
> I whacked this around quite a bit, and I think it's in a committable state
> now. But if you could run whatever tests you were using before on this, to
> make sure it still produces the same estimates, that would be great. I
> didn't change the estimates it should produce, only the code structure.
>
> One thing that bothers me slightly with this patch is the way it peeks
> into the Most-Common-Elements arrays, which is produced by the built-in
> type analyze function. If we ever change what statistics are collected for
> arrays, or the way they are stored, this might break. In matchsel, why
> don't we just call the built-in estimator function for each element that we
> need to probe, and not look into the statistics ourselves at all? I
> actually experimented with that, and it did slash much of the code, and it
> would be more future-proof. However, it was also a lot slower for queries
> that contain multiple values. That's understandable: the built-in estimator
> will fetch the statistics tuple, parse the arrays, etc. separately for each
> value in the query_int, while this patch will do it only once for the whole
> query, and perform a simple binary search for each value. So overall, I
> think this is OK as it is. But if we find that we need to use the MCE list
> in this fashion in more places in the future, it might be worthwhile to add
> some support code for this in the backend to allow extracting the stats
> once, and doing multiple "lightweight estimations" using the extracted
> stats.
>

Yeah, I see. We could end up with something like this. But probably we
would need something more general for extensions which wants to play with
statistics.
For instance, pg_trgm could estimate selectivity for "text % text"
operator. But in order to provide that it needs trigram statistics. Now it
could be implemented by defining separate datatype, but it's a kluge.
Probably, we would end up with custom additional statistics for datatypes.

> Some things I fixed/changed:
>
> * I didn't like that transformOperator() function, which looked up the
> function's name. I replaced it with separate wrapper functions for each
> operator, so that the built-in operator's OID can be hardcoded into each.
>
> * I refactored the matchsel function heavily. I think it's more readable
> now.
>
> * I got rid of the Int4Freq array. It didn't seem significantly easier to
> work with than the separate values/numbers arrays, so I just used those
> directly.
>
> * Also use the matchsel estimator for ~~ (the commutator of @@)
>

In this version of patch it's not checked if variable is actually and int[]
not query_int. See following test case.

# create table test2 as (select '1'::query_int val from
generate_series(1,1000000));
# analyze test2;

# explain select * from test2 where '{1}'::int[] @@ val;
ERROR: unrecognized int query item type: 0

I've added this check.

* Also use the estimators for the obsolete @ and ~ operators. Not that I
> care much about those as they are obsolete, but seems strange not to, as
> it's a trivial matter of setting the right estimator function.
>
> * I added an ANALYZE in the regression test. It still won't systematically
> test all the cost estimation code, and there's nothing to check that the
> estimates make sense, but at least more of the code will now run.

You also forgot to include intarray--1.0--1.1.sql into patch. I've also
added it.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment Content-Type Size
intarray-sel-4.patch application/octet-stream 38.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fabien COELHO 2015-07-21 16:29:55 Re: pgbench stats per script & other stuff
Previous Message Robert Haas 2015-07-21 16:23:47 ALTER TABLE .. ADD PRIMARY KEY .. USING INDEX has dump-restore hazard