Re: Collect frequency statistics for arrays

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Noah Misch <noah(at)leadboat(dot)com>, Nathan Boley <npboley(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Collect frequency statistics for arrays
Date: 2012-03-04 14:24:36
Message-ID: CAPpHfdvbM0d4gRb0UrR3ud2C6StKW7bZ8bio88FCiFYvxZX8GQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Mar 4, 2012 at 5:38 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> 2. The tests in the above-mentioned message show that in most cases
> where mcelem_array_contained_selec falls through to the "rough
> estimate", the resulting rowcount estimate is just 1, ie we are coming
> out with very small selectivities. Although that path will now only be
> taken when there are no stats, it seems like we'd be better off to
> return DEFAULT_CONTAIN_SEL instead of what it's doing. I think there
> must be something wrong with the "rough estimate" logic. Could you
> recheck that?
>

I think the wrong think with "rough estimate" is that assumption about
independent occurrences of items is very unsuitable even for "rough
estimate". The following example shows that "rough estimate" really works
in the case of independent occurrences of items.

Generate test table where item occurrences are really independent.

test=# create table test as select ('{'||(select string_agg(s,',') from
(select case when (t*0 + random()) < 0.1 then i::text else null end from
generate_series(1,100) i) as x(s))||'}')::int[] AS val from
generate_series(1,10000) t;

SELECT 10000

test=# analyze test;
ANALYZE

Do some test.

test=# explain analyze select * from test where val <@
array[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60];

QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on test (cost=0.00..239.00 rows=151 width=61) (actual
time=0.325..32.556 rows=163 loops=1
)
Filter: (val <@
'{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60}'::integer[])
Rows Removed by Filter: 9837
Total runtime: 32.806 ms
(4 rows)

Delete DECHIST statistics.

test=# update pg_statistic set stakind1 = 0, staop1 = 0, stanumbers1 =
null, stavalues1 = null where starelid = (select oid from pg_class where
relname = 'test') and stakind1 = 5;
UPDATE 0
test=# update pg_statistic set stakind2 = 0, staop2 = 0, stanumbers2 =
null, stavalues2 = null where starelid = (select oid from pg_class where
relname = 'test') and stakind2 = 5;
UPDATE 0
test=# update pg_statistic set stakind3 = 0, staop3 = 0, stanumbers3 =
null, stavalues3 = null where starelid = (select oid from pg_class where
relname = 'test') and stakind3 = 5;
UPDATE 0
test=# update pg_statistic set stakind4 = 0, staop4 = 0, stanumbers4 =
null, stavalues4 = null where starelid = (select oid from pg_class where
relname = 'test') and stakind4 = 5;
UPDATE 1
test=# update pg_statistic set stakind5 = 0, staop5 = 0, stanumbers5 =
null, stavalues5 = null where starelid = (select oid from pg_class where
relname = 'test') and stakind5 = 5;
UPDATE 0

Do another test.

test=# explain analyze select * from test where val <@
array[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60];

QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on test (cost=0.00..239.00 rows=148 width=61) (actual
time=0.332..32.952 rows=163 loops=1)
Filter: (val <@
'{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60}'::integer[])
Rows Removed by Filter: 9837
Total runtime: 33.225 ms
(4 rows)

It this particular case "rough estimate" is quite accurate. But in most
part of cases it behaves really bad. It is why I started to invent
calc_distr and etc. So, I think return DEFAULT_CONTAIN_SEL is OK unless
we've some better ideas.

------
With best regards,
Alexander Korotkov.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniele Varrazzo 2012-03-04 15:05:08 Re: Patch: improve selectivity estimation for IN/NOT IN
Previous Message Euler Taveira de Oliveira 2012-03-04 13:44:40 Re: Patch: improve selectivity estimation for IN/NOT IN