Re: Index Skip Scan

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
Cc: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, florisvannee(at)optiver(dot)com, pg(at)bowt(dot)ie, jesper(dot)pedersen(at)redhat(dot)com, david(dot)rowley(at)2ndquadrant(dot)com, alvherre(at)2ndquadrant(dot)com, thomas(dot)munro(at)gmail(dot)com, jtc331(at)gmail(dot)com, rafia(dot)pghackers(at)gmail(dot)com, jeff(dot)janes(at)gmail(dot)com, bhushan(dot)uparkar(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, a(dot)korotkov(at)postgrespro(dot)ru
Subject: Re: Index Skip Scan
Date: 2020-02-08 14:22:17
Message-ID: 20200208142217.fhsiymg3szfnniih@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

I've done some testing and benchmarking of the v31 patch, looking for
regressions, costing issues etc. Essentially, I've ran a bunch of SELECT
DISTINCT queries on data sets of various size, number of distinct values
etc. The results are fairly large, so I've uploaded them to github

https://github.com/tvondra/skip-scan-test

There are four benchmark groups, depending on how the data are generated
and availability of extended stats and if the columns are independent:

1) skipscan - just indexes, columns are independent

2) skipscan-with-stats - indexes and extended stats, independent columns

3) skipscan-correlated - just indexes, correlated columns

4) skipscan-correlated-with-stats - indexes and extended stats,
correlated columns

The github repository contains *.ods spreadsheet comparing duration with
the regular query plan (no skip scan) and skip scan. In general, there
are pretty massive speedups, often by about two orders of magnitude.

There are a couple of regressions, where the plan with skipscan enables
is ~10x slower. But this seems to happen only in high-cardinality cases
where we misestimate the number of groups. Consider a table with two
independent columns

CREATE TABLE t (a text, b text);
INSERT INTO t SELECT
md5((10000*random())::int::text),
md5((10000*random())::int::text)
FROM generate_series(1,1000000) s(i);

CREATE INDEX ON t(a,b);

ANALYZE;

which then behaves like this:

test=# select * from (select distinct a,b from t) foo offset 10000000;
Time: 3138.222 ms (00:03.138)
test=# set enable_indexskipscan = off;
Time: 0.312 ms
test=# select * from (select distinct a,b from t) foo offset 10000000;
Time: 199.749 ms

So in this case the skip scan is ~15x slower than the usual plan (index
only scan + unique). The reason why this happens is pretty simple - to
estimate the number of groups we multiply the ndistinct estimates for
the two columns (which both have n_distinct = 10000), but then we cap
the estimate to 10% of the table. But when the columns are independent
with high cardinalities that under-estimates the actual value, making
the cost for skip scan much lower than it should be.

I don't think this is an issue the skipscan patch needs to fix, though.
Firstly, the regressed cases are a tiny minority. Secondly, we already
have a way to improve the root cause - creating extended stats with
ndistinct coefficients generally makes the problem go away.

One interesting observation however is that this regression only
happened with text columns but not with int or bigint. My assumption is
that this is due to text comparisons being much more expensive. Not sure
if there is something we could do to deal with this - reduce the number
of comparisons or something?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2020-02-08 14:56:35 Re: Index Skip Scan
Previous Message Stephen Frost 2020-02-08 13:54:30 Re: Marking some contrib modules as trusted extensions