Re: Statistics and selectivity estimation for ranges

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Statistics and selectivity estimation for ranges
Date: 2012-08-14 06:45:58
Message-ID: CAPpHfdudkYhJkGbhQtbjt6GG5Owp_v7_kpQKA=1Dz5Fb8iGteA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Aug 13, 2012 at 1:11 AM, Alexander Korotkov <aekorotkov(at)gmail(dot)com>wrote:

> On Thu, Aug 9, 2012 at 12:44 AM, Alexander Korotkov <aekorotkov(at)gmail(dot)com>wrote:
>
>> My conclusion is so, that current errors are probably ok for selectivity
>> estimation. But taking into attention that generated datasets ideally fits
>> assumptions of estimation, there could be room for improvement. Especially,
>> it's unclear why estimate for "<@" and "@>" have much greater error than
>> estimate for "&&". Possibly, it's caused by some bugs.
>>
>
> ITSM, I found reason of inaccuracy. Implementation of linear interpolation
> was wrong. Fixed version is attached. Now, need to rerun tests, possible
> refactoring and comments rework.
>

After fixing few more bugs, I've a version with much more reasonable
accuracy.

Statistics target = 100.

Relatively large result sets (>= 10)

test=# select operator, avg(estimate_count::float8/actual_count::float8) as
avg_ratio, avg(exp(abs(ln(estimate_count::float8/actual_count::float8)))) -
1.0 as avg_error from datasets d join test_results tr on tr.test_id =
d.idwhere d.stat_target = 100 and actual_count >= 10 group by
operator;
operator | avg_ratio | avg_error
----------+------------------+--------------------
<@ | 1.00404179116863 | 0.0504415454560903
@> | 1.06364108531688 | 0.105646077989812
&& | 1.00757984721409 | 0.0420984234933233
(3 rows)

Small result sets (1 - 9)

test=# select operator, avg(estimate_count::float8/actual_count::float8) as
avg_ratio, avg(exp(abs(ln(estimate_count::float8/actual_count::float8)))) -
1.0 as avg_error from datasets d join test_results tr on tr.test_id =
d.idwhere d.stat_target = 100 and actual_count between 1 and 9 group
by
operator;
operator | avg_ratio | avg_error
----------+------------------+-------------------
<@ | 1.31530838062865 | 0.654886592410495
@> | 2.78708078320147 | 1.94124123003433
&& | 1.93268112525538 | 1.09904919063335
(3 rows)

Empty result sets

test=# select operator, avg(estimate_count) as avg_estimate, count(*) as
tests_count from datasets d join test_results tr on tr.test_id = d.id where
d.stat_target = 100 and actual_count = 0 group by operator;
operator | avg_estimate | tests_count
----------+--------------------+-------------
<@ | 1.1437670609645132 | 1099
@> | 1.0479430126460701 | 87458
(2 rows)

Statistics target = 1000.

Relatively large result sets (>= 10)

test=# select operator, avg(estimate_count::float8/actual_count::float8) as
avg_ratio, avg(exp(abs(ln(estimate_count::float8/actual_count::float8)))) -
1.0 as avg_error from datasets d join test_results tr on tr.test_id =
d.idwhere d.stat_target = 1000 and actual_count >= 10 group by
operator;
operator | avg_ratio | avg_error
----------+------------------+--------------------
<@ | 1.00073999445381 | 0.045099762607524
@> | 1.05296320350853 | 0.0907489633452971
&& | 1.00217602359039 | 0.0353421159150165
(3 rows)

Small result sets (1 - 9)

test=# select operator, avg(estimate_count::float8/actual_count::float8) as
avg_ratio, avg(exp(abs(ln(estimate_count::float8/actual_count::float8)))) -
1.0 as avg_error from datasets d join test_results tr on tr.test_id =
d.idwhere d.stat_target = 1000 and actual_count between 1 and 9 group
by
operator;
operator | avg_ratio | avg_error
----------+------------------+-------------------
<@ | 1.26946358795998 | 0.577803898836364
@> | 2.69000633430211 | 1.83165424646645
&& | 1.48715184186882 | 0.577998652291105
(3 rows)

Empty result sets

test=# select operator, avg(estimate_count) as avg_estimate, count(*) as
tests_count from datasets d join test_results tr on tr.test_id = d.id where
d.stat_target = 1000 and actual_count = 0 group by operator;
operator | avg_estimate | tests_count
----------+--------------------+-------------
<@ | 1.0887096774193548 | 1364
@> | 1.0423876983771183 | 89224
&& | 5.0000000000000000 | 1
(3 rows)

------
With best regards,
Alexander Korotkov.

Attachment Content-Type Size
range_stat-0.6.patch.gz application/x-gzip 10.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2012-08-14 07:46:04 patch: shared session variables
Previous Message Kevin Grittner 2012-08-14 05:27:20 Re: default_isolation_level='serializable' crashes on Windows