Quick Links

Re: Statistics and selectivity estimation for ranges

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Statistics and selectivity estimation for ranges
Date:	2012-08-08 20:44:52
Message-ID:	CAPpHfdsaGJ_u+1Bw0SsQsF4O=b90sMqQw0eDpbcQi7vTqwvw1g@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

For testing statistics accuracy I've used same datasets as for testing
opclasses performance:
http://archives.postgresql.org/pgsql-hackers/2012-07/msg00414.php
Script for testing and database schema is attached.
Dump with tests results can be downloaded here:
http://test.i-gene.ru/uploads/range_stat_tests.sql.gz

Following table shows statistics of accuracy when actual count of rows is
somewhat large (>=10). Second column shows average ratio of estimate count
of rows to actual count of rows. Third column shows average relative error
of estimation.

range_test=# select operator,
avg(estimate_count::float8/actual_count::float8) as avg_ratio,
avg(exp(abs(ln(estimate_count::float8/actual_count::float8)))) - 1.0 as
avg_error from datasets d join test_results tr on tr.test_id = d.id where
d.stat_target = 100 and actual_count >= 10 group by operator;
operator | avg_ratio | avg_error
----------+------------------+-------------------
<@ | 1.27166784340153 | 0.498570654434906
@> | 1.35965412121763 | 0.384991198200582
&& | 1.08236985243139 | 0.105298599354035
(3 rows)

When result set is small (1-9 rows) then errors are more significant.

range_test=# select operator,
avg(estimate_count::float8/actual_count::float8) as avg_ratio,
avg(exp(abs(ln(estimate_count::float8/actual_count::float8)))) - 1.0 as
avg_error from datasets d join test_results tr on tr.test_id = d.id where
d.stat_target = 100 and actual_count between 1 and 9 group by operator;
operator | avg_ratio | avg_error
----------+------------------+------------------
<@ | 3.51371646596783 | 2.85624536756285
@> | 3.85482923324034 | 2.91433432363562
&& | 3.14281204906205 | 2.28899260461761
(3 rows)

Following table presents average estimate count of rows when actual count
of rows is 0. This value is quite high for && operator, but it comes from
only one tests, so it's not really representative.

range_test=# select operator, avg(estimate_count) as avg_estimate, count(*)
as tests_count from datasets d join test_results tr on tr.test_id =
d.idwhere d.stat_target = 100 and actual_count = 0 group by operator;
operator | avg_estimate | tests_count
----------+---------------------+-------------
<@ | 1.1259887005649718 | 1770
@> | 1.0598670878194025 | 88329
&& | 28.0000000000000000 | 1
(3 rows)

Same tables for statistics target = 1000.

range_test=# select operator,
avg(estimate_count::float8/actual_count::float8) as avg_ratio,
avg(exp(abs(ln(estimate_count::float8/actual_count::float8)))) - 1.0 as
avg_error from datasets d join test_results tr on tr.test_id = d.id where
d.stat_target = 1000 and actual_count >= 10 group by operator;
operator | avg_ratio | avg_error
----------+------------------+--------------------
<@ | 1.17132962269887 | 0.394427785424827
@> | 1.35677772347908 | 0.376171286348914
&& | 1.06762781136499 | 0.0874012522386387
(3 rows)

range_test=# select operator,
avg(estimate_count::float8/actual_count::float8) as avg_ratio,
avg(exp(abs(ln(estimate_count::float8/actual_count::float8)))) - 1.0 as
avg_error from datasets d join test_results tr on tr.test_id = d.id where
d.stat_target = 1000 and actual_count between 1 and 9 group by operator;
operator | avg_ratio | avg_error
----------+------------------+------------------
<@ | 3.30836881177966 | 2.64459517657192
@> | 3.47535917820028 | 2.55199556747496
&& | 2.49181718664477 | 1.49181718664477
(3 rows)

range_test=# select operator, avg(estimate_count) as avg_estimate, count(*)
as tests_count from datasets d join test_results tr on tr.test_id =
d.idwhere d.stat_target = 1000 and actual_count = 0 group by operator;
operator | avg_estimate | tests_count
----------+--------------------+-------------
<@ | 1.1650879566982409 | 739
@> | 1.0511811463771843 | 89447
(2 rows)

My conclusion is so, that current errors are probably ok for selectivity
estimation. But taking into attention that generated datasets ideally fits
assumptions of estimation, there could be room for improvement. Especially,
it's unclear why estimate for "<@" and "@>" have much greater error than
estimate for "&&". Possibly, it's caused by some bugs.

------
With best regards,
Alexander Korotkov.

Attachment	Content-Type	Size
range_stat.php	application/x-httpd-php	6.7 KB
range_stat_schema.sql	application/octet-stream	105.2 KB

In response to

Re: Statistics and selectivity estimation for ranges at 2012-08-07 04:25:45 from Alexander Korotkov

Responses

Re: Statistics and selectivity estimation for ranges at 2012-08-09 15:35:50 from Alexander Korotkov
Re: Statistics and selectivity estimation for ranges at 2012-08-12 21:11:51 from Alexander Korotkov

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Robert Haas	2012-08-08 20:55:43	Re: [PATCH] Make "psql -1 < file.sql" work as with "-f"
Previous Message	Robert Haas	2012-08-08 20:23:04	Re: -Wformat-zero-length