statistics for array types

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: statistics for array types
Date: 2015-08-11 14:38:54
Message-ID: CAMkU=1x2W1gpEP3AQsrSA30uxQk1Sau5VDOLL4LkhWLwrOY8Lw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

When reviewing some recent patches, I decided the statistics gathered for
arrays had some pre-existing shortcomings.

The main one is that when the arrays contain rare elements there is no
histogram to fall back upon when the MCE array is empty, the way there is
for scalar stats. So it has to punt completely and resort to saying that
it is 0.5% selectivity without recourse to any data at all.

The rationale for applying the threshold before things are eligible for
inclusion in the MCE array seems to be that this puts some theoretical
bound on the amount of error we are likely to have in that element. But I
think it is better to exceed that theoretical bound than it is to have no
data at all.

The attached patch forces there to be at least one element in MCE, keeping
the one element with the highest predicted frequency if the MCE would
otherwise be empty. Then any other element queried for is assumed to be no
more common than this most common element.

I'd also briefly considered just having the part of the code that pulls the
stats out of pg_stats interpret a MCE array as meaning that nothing is more
frequent than the threshold, but that would mean that that part of the code
needs to know about how the threshold is chosen, which just seems wrong.
And it would need to know the difference between NULL MCE because no stats
were gathered, versus because stats were gathered but nothing met the
threshold.

I'm only marginally interested in this area, but since I did the
investigation into it already I wanted to present it here. I'd be quite
happy if someone else wanted to take over the patch (or the concept behind
it).

Here is an example, where the estimate drops from 500 rows to 3 rows where
the correct number is zero:

create table foobar as
select (
select
array_agg(floor((random()*1000000000+g.y/1000000+f.x/10000000))::int)
from generate_series(1,10) g(y)
) foo
from generate_series(1,100000) f(x);

Unpatched:

analyze;

explain (analyze) select * from foobar where foo @> '{567}';
QUERY PLAN
--------------------------------------------------------------------------------------------------------
Seq Scan on foobar (cost=0.00..2387.00 rows=500 width=61) (actual
time=76.448..76.448 rows=0 loops=1)
Filter: (foo @> '{567}'::integer[])
Rows Removed by Filter: 100000
Planning time: 0.211 ms
Execution time: 76.492 ms

With patch:

analyze;

explain (analyze) select * from foobar where foo @> '{567}';
QUERY PLAN
------------------------------------------------------------------------------------------------------
Seq Scan on foobar (cost=0.00..2387.00 rows=3 width=61) (actual
time=78.604..78.604 rows=0 loops=1)
Filter: (foo @> '{567}'::integer[])
Rows Removed by Filter: 100000
Planning time: 0.199 ms
Execution time: 78.665 ms

Cheers,

Jeff

Attachment Content-Type Size
array_type_analyze_MCE_V001.patch application/octet-stream 1.8 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Fetter 2015-08-11 14:41:43 Re: [patch] A \pivot command for psql
Previous Message Robert Haas 2015-08-11 14:28:15 Re: WIP: SCRAM authentication