Re: Statistics and selectivity estimation for ranges

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Statistics and selectivity estimation for ranges
Date: 2012-08-08 20:44:52
Message-ID: CAPpHfdsaGJ_u+1Bw0SsQsF4O=b90sMqQw0eDpbcQi7vTqwvw1g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

For testing statistics accuracy I've used same datasets as for testing
opclasses performance:
http://archives.postgresql.org/pgsql-hackers/2012-07/msg00414.php
Script for testing and database schema is attached.
Dump with tests results can be downloaded here:
http://test.i-gene.ru/uploads/range_stat_tests.sql.gz

Following table shows statistics of accuracy when actual count of rows is
somewhat large (>=10). Second column shows average ratio of estimate count
of rows to actual count of rows. Third column shows average relative error
of estimation.

range_test=# select operator,
avg(estimate_count::float8/actual_count::float8) as avg_ratio,
avg(exp(abs(ln(estimate_count::float8/actual_count::float8)))) - 1.0 as
avg_error from datasets d join test_results tr on tr.test_id = d.id where
d.stat_target = 100 and actual_count >= 10 group by operator;
operator | avg_ratio | avg_error
----------+------------------+-------------------
<@ | 1.27166784340153 | 0.498570654434906
@> | 1.35965412121763 | 0.384991198200582
&& | 1.08236985243139 | 0.105298599354035
(3 rows)

When result set is small (1-9 rows) then errors are more significant.

range_test=# select operator,
avg(estimate_count::float8/actual_count::float8) as avg_ratio,
avg(exp(abs(ln(estimate_count::float8/actual_count::float8)))) - 1.0 as
avg_error from datasets d join test_results tr on tr.test_id = d.id where
d.stat_target = 100 and actual_count between 1 and 9 group by operator;
operator | avg_ratio | avg_error
----------+------------------+------------------
<@ | 3.51371646596783 | 2.85624536756285
@> | 3.85482923324034 | 2.91433432363562
&& | 3.14281204906205 | 2.28899260461761
(3 rows)

Following table presents average estimate count of rows when actual count
of rows is 0. This value is quite high for && operator, but it comes from
only one tests, so it's not really representative.

range_test=# select operator, avg(estimate_count) as avg_estimate, count(*)
as tests_count from datasets d join test_results tr on tr.test_id =
d.idwhere d.stat_target = 100 and actual_count = 0 group by operator;
operator | avg_estimate | tests_count
----------+---------------------+-------------
<@ | 1.1259887005649718 | 1770
@> | 1.0598670878194025 | 88329
&& | 28.0000000000000000 | 1
(3 rows)

Same tables for statistics target = 1000.

range_test=# select operator,
avg(estimate_count::float8/actual_count::float8) as avg_ratio,
avg(exp(abs(ln(estimate_count::float8/actual_count::float8)))) - 1.0 as
avg_error from datasets d join test_results tr on tr.test_id = d.id where
d.stat_target = 1000 and actual_count >= 10 group by operator;
operator | avg_ratio | avg_error
----------+------------------+--------------------
<@ | 1.17132962269887 | 0.394427785424827
@> | 1.35677772347908 | 0.376171286348914
&& | 1.06762781136499 | 0.0874012522386387
(3 rows)

range_test=# select operator,
avg(estimate_count::float8/actual_count::float8) as avg_ratio,
avg(exp(abs(ln(estimate_count::float8/actual_count::float8)))) - 1.0 as
avg_error from datasets d join test_results tr on tr.test_id = d.id where
d.stat_target = 1000 and actual_count between 1 and 9 group by operator;
operator | avg_ratio | avg_error
----------+------------------+------------------
<@ | 3.30836881177966 | 2.64459517657192
@> | 3.47535917820028 | 2.55199556747496
&& | 2.49181718664477 | 1.49181718664477
(3 rows)

range_test=# select operator, avg(estimate_count) as avg_estimate, count(*)
as tests_count from datasets d join test_results tr on tr.test_id =
d.idwhere d.stat_target = 1000 and actual_count = 0 group by operator;
operator | avg_estimate | tests_count
----------+--------------------+-------------
<@ | 1.1650879566982409 | 739
@> | 1.0511811463771843 | 89447
(2 rows)

My conclusion is so, that current errors are probably ok for selectivity
estimation. But taking into attention that generated datasets ideally fits
assumptions of estimation, there could be room for improvement. Especially,
it's unclear why estimate for "<@" and "@>" have much greater error than
estimate for "&&". Possibly, it's caused by some bugs.

------
With best regards,
Alexander Korotkov.

Attachment Content-Type Size
range_stat.php application/x-httpd-php 6.7 KB
range_stat_schema.sql application/octet-stream 105.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2012-08-08 20:55:43 Re: [PATCH] Make "psql -1 < file.sql" work as with "-f"
Previous Message Robert Haas 2012-08-08 20:23:04 Re: -Wformat-zero-length