WIP patch: distinguish selectivity of < from <= and > from >=

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: WIP patch: distinguish selectivity of < from <= and > from >=
Date: 2017-07-04 03:53:30
Message-ID: 12232.1499140410@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I was reminded by bug #14729 that we are not very good on corner cases
involving inequality operators, ie < <= > >=. Historically, the
selectivity functions have simply not distinguished < from <= or >
from >=, arguing that the fraction of the population that satisfies
the "=" aspect can be considered to be vanishingly small, if the
comparison value isn't any of the most-common-values for the variable.
(If it is, the code path that executes the operator against each MCV
will take care of things properly.) But that isn't really true unless
we're dealing with a continuum of variable values, and in practice
we seldom are. If "x = const" would estimate a nonzero number of
rows for a given const value, then it follows that we ought to estimate
different numbers of rows for "x < const" and "x <= const", whether or
not the const is one of the MCVs.

Hence, the attached patch, which splits the scalar inequality selectivity
functions into four sets instead of two. This demonstrably produces
better estimates, for example using the "tenk1" table in the regression
database:

regression=# explain analyze select * from tenk1 where thousand < 10;
before:
Bitmap Heap Scan on tenk1 (cost=5.14..241.38 rows=110 width=244) (actual time=0.121..0.623 rows=100 loops=1)
with patch:
Bitmap Heap Scan on tenk1 (cost=5.06..227.42 rows=100 width=244) (actual time=0.054..0.300 rows=100 loops=1)

regression=# explain analyze select * from tenk1 where thousand <= 10;
before:
Bitmap Heap Scan on tenk1 (cost=5.14..241.38 rows=110 width=244) (actual time=0.120..0.383 rows=110 loops=1)
with patch:
Bitmap Heap Scan on tenk1 (cost=5.14..241.38 rows=110 width=244) (actual time=0.062..0.288 rows=110 loops=1)

regression=# explain analyze select * from tenk1 where thousand > 10;
before:
Seq Scan on tenk1 (cost=0.00..483.00 rows=9890 width=244) (actual time=0.030..6.276 rows=9890 loops=1)
with patch:
Seq Scan on tenk1 (cost=0.00..483.00 rows=9890 width=244) (actual time=0.019..4.881 rows=9890 loops=1)

regression=# explain analyze select * from tenk1 where thousand >= 10;
before:
Seq Scan on tenk1 (cost=0.00..483.00 rows=9890 width=244) (actual time=0.022..5.371 rows=9900 loops=1)
with patch:
Seq Scan on tenk1 (cost=0.00..483.00 rows=9900 width=244) (actual time=0.014..3.783 rows=9900 loops=1)

regression=# explain analyze select * from tenk1 where thousand between 10 and 11;
before:
Bitmap Heap Scan on tenk1 (cost=4.39..39.52 rows=10 width=244) (actual time=0.082..0.215 rows=20 loops=1)
with patch:
Bitmap Heap Scan on tenk1 (cost=4.49..70.61 rows=20 width=244) (actual time=0.080..0.207 rows=20 loops=1)
Recheck Cond: ((thousand >= 10) AND (thousand <= 11))

regression=# explain analyze select * from tenk1 where thousand between 10 and 10;
before:
Index Scan using tenk1_thous_tenthous on tenk1 (cost=0.29..8.30 rows=1 width=244) (actual time=0.041..0.112 rows=10 loops=1)
with patch:
Bitmap Heap Scan on tenk1 (cost=4.39..39.52 rows=10 width=244) (actual time=0.074..0.142 rows=10 loops=1)

As these examples show, it's cases with very tight range constraints where
this really makes enough difference to be worth doing. I believe the
complaint in bug #14729 basically corresponds to the last example, where
the erroneous rowcount estimate is driving a bad choice of join plan.

Aside from the mind-bendingly-tedious changes in pg_operator.h, the meat
of the patch is in selfuncs.c's ineq_histogram_selectivity(), which now
applies a correction for equal values in the cases where we were getting
it wrong before. While this logic seems experimentally correct (see
above), I have to admit that I've failed to wrap my brain around exactly
why it's correct. The arguments that I've constructed so far seem to
point in the direction of applying the opposite correction, which is
demonstrably wrong. Perhaps someone whose college statistics class
wasn't quite so long ago can explain this satisfactorily?

Aside from the missing/inadequate comment about why to apply this
correction, there remains work to update several contrib modules that
reference scalarltsel/scalargtsel. That could be done separately though.
An extension that doesn't change its <= or >= operators to reference
scalarlesel/scalargesel isn't worse off than before, it's just failing
to benefit from this improvement.

Obviously this is too late for v10; I'll stick it into the next
commitfest.

regards, tom lane

Attachment Content-Type Size
improve-scalar-inequality-estimates-1.patch text/x-diff 74.9 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro HORIGUCHI 2017-07-04 04:08:46 Re: asynchronous execution
Previous Message Masahiko Sawada 2017-07-04 03:49:34 Re: Get stuck when dropping a subscription during synchronizing table