Re: Selectivity estimation for intarray with @@

From: Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Uriy Zhuravlev <u(dot)zhuravlev(at)postgrespro(dot)ru>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Selectivity estimation for intarray with @@
Date: 2015-07-21 12:44:50
Message-ID: CAPpHfdurgec2DNdL1OzO00ie32mP8Tdi=b9RQGkwfWa0Bqi90Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi!

While Uriy is on vacation, I've revised this patch a bit.

On Tue, Jul 14, 2015 at 8:55 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:

> Hi Uriy,
>
> This patch looks pretty good.
>
> The first line of intarray--1.1.sql mis-identifies itself as "/*
> contrib/intarray/intarray--1.0.sql */"
>

Fixed.

> The real file intarray--1.0.sql file probably should not be included in
> the final patch, but I like having it for testing.
>

I've removed intarray--1.0.sql since it shouldn't be in final commit.

> It applies and builds cleanly over the alter operator patch (and now the
> commit as well), passes make check of the contrib module both with and
> without cassert.
>
> I could succesfully upgrade from version 1.0 to 1.1 without having to drop
> the gin__int_ops indexes in the process.
>
> I could do pg_upgrade from 9.2 and 9.4 to 9.6devel with large indexes in
> place, and then upgrade the extension to 1.1, and it worked without having
> to rebuild the index.
>
> It does what it says, and I think we want this.
>

Good.

> There were some cases where the estimates were not very good, but that
> seems to be limitation of what pg_stats makes available, not of this
> patch. Specifically if the elements listed in the query text are not part
> of most_common_elems (or worse yet, most_common_elems is null) then it is
> left to guess with no real data, and those guesses can be pretty bad. It
> is not this patches job to fix that, however.
>

> It assumes all the stats are independent and so doesn't account for
> correlation between members. This is also how the core selectivity
> estimates work between columns, so I can't really complain about that. It
> is easy to trick it with things like @@ '(!300 & 300)'::query_int, but I
> don't think that is necessary to fix that.
>

Analysis of all the dependencies inside query is NP-complete task. We could
try to workout simple cases, but postgres optimizer currently doesn't care
about it.

# explain select * from test where val = 'val' and val != 'val';
QUERY PLAN
-------------------------------------------------------------
Seq Scan on test (cost=0.00..39831.00 rows=499995 width=8)
Filter: ((val <> 'val'::text) AND (val = 'val'::text))
(2 rows)

I think we could do the same inside intquery until we figure out some
better solution for postgres optimizer in general.

> I have not been creative enough to come up with queries for which this
> improvement in selectivity estimate causes the execution plan to change in
> important ways. I'm sure the serious users of this module would have such
> cases, however.
>
> I haven't tested gist__int_ops as I can't get those indexes to build in a
> feasible amount of time. But the selectivity estimates are independent of
> any actual index, so testing those doesn't seem to be critical.
>
> There is no documentation change, which makes sense as this internal stuff
> which isn't documented to start with.
>

Yes. For instance, tsquery make very similar selectivity estimation as
intquery, but it's assumed to be internal and isn't mentioned in
documentation.

There are no regression test changes. Not sure about that, it does seem
> like regression tests would be desirable.
>

It would be nice to cover selectivity estimation with regression tests, but
AFAICS we didn't do it for any selectivity estimation functions yet.
Problem is that selectivity estimation is quite complex process and its
result depending on random sampling during analyze, floating points
operations and so on. We could make a test like "with very high level of
confidence, estimate number of rows here should be in [10;100]". But it's
hard to fit such assumptions into our current regression tests
infrastructure.

I haven't gone through the C code.
>

I also did some coding style and comments modifications.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment Content-Type Size
intarray_sel-2.patch application/octet-stream 36.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2015-07-21 12:52:00 Re: Fillfactor for GIN indexes
Previous Message Michael Paquier 2015-07-21 11:58:15 Re: "make check" changes have caused buildfarm deterioration.