From: | "Yuzuko Hosoya" <hosoya(dot)yuzuko(at)lab(dot)ntt(dot)co(dot)jp> |
---|---|
To: | <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Improve selectivity estimate for range queries |
Date: | 2018-12-20 08:21:29 |
Message-ID: | 008701d4983d$02e731c0$08b59540$@lab.ntt.co.jp |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
I found the problem about selectivity estimate for range queries
on master as follows. This is my test case:
postgres=# CREATE TABLE test(id int, val text);
CREATE TABLE
postgres=# INSERT INTO test SELECT i, 'testval' FROM generate_series(0,29999)i;
INSERT 0 30000
postgres=# VACUUM analyze test;
VACUUM
postgres=# EXPLAIN ANALYZE SELECT * FROM test WHERE id > 0 and id < 10000;
QUERY PLAN
----------------------------------------------------------------------------------------------------
---
Seq Scan on test (cost=0.00..613.00 rows=150 width=12) (actual time=0.044..21.371 rows=9999
loops=1)
Filter: ((id > 0) AND (id < 10000))
Rows Removed by Filter: 20001
Planning Time: 0.099 ms
Execution Time: 37.142 ms
(5 rows)
Although the actual number of rows was 9999, the estimated number
of rows was 150.
So I dug to see what happened and thought that the following part
in clauselist_selectivity caused this problem.
------------
/*
* Now scan the rangequery pair list.
*/
while (rqlist != NULL)
{
RangeQueryClause *rqnext;
if (rqlist->have_lobound && rqlist->have_hibound)
{
/* Successfully matched a pair of range clauses */
Selectivity s2;
/*
* Exact equality to the default value probably means the
* selectivity function punted. This is not airtight but should
* be good enough.
*/
if (rqlist->hibound == DEFAULT_INEQ_SEL ||
rqlist->lobound == DEFAULT_INEQ_SEL)
{
s2 = DEFAULT_RANGE_INEQ_SEL;
}
else
{
s2 = rqlist->hibound + rqlist->lobound - 1.0;
------------
In my environment, the selectivity for id > 0 was 0.99990000000000001,
and the selectivity for id < 10000 was 0.33333333333333331. Then, the
value of rqlist->hibound and rqlist->lobound are set respectively.
On the other hand, DEFAULT_INEQ_SEL is 0.3333333333333333. As a result,
the final selectivity is computed according to DEFAULT_RANGE_INEQ_SEL,
since the condition of the second if statement was satisfied. However,
these selectivities were computed according to the statistics, so the
final selectivity had to be calculated from rqlist->hibound + rqlist->lobound - 1.0.
My test case might be uncommon, but I think it would cause some problems.
To handle such cases I've thought up of an idea based on a previous commit[1]
which solved a similar problem related to DEFAULT_NUM_DISTINCT. Just like
this modification, I added a flag which shows whether the selectivity
was calculated according to the statistics or not to clauselist_selectivity,
and used it as a condition of the if statement instead of
rqlist->hibound == DEFAULT_INEQ_SEL and rqlist->lobound == DEFAULT_INEQ_SEL.
I am afraid that changes were more than I had expected, but we can estimate
selectivity correctly.
Patch attached. Do you have any thoughts?
[1]
https://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=4c2777d0b733220d9029f78817af8ce
Best regards,
Yuzuko Hosoya
NTT Open Source Software Center
Attachment | Content-Type | Size |
---|---|---|
improve_selectivity_estimate_for_range_queries.patch | application/octet-stream | 15.6 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Kyotaro HORIGUCHI | 2018-12-20 08:32:25 | Re: [HACKERS] WAL logging problem in 9.4.3? |
Previous Message | Ideriha, Takeshi | 2018-12-20 08:17:27 | RE: Localization tools for Postgres |