Quick Links

Re: Optimizer misses big in 10.4 with BRIN index

From:	Arcadiy Ivanov <arcadiy(at)gmail(dot)com>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, emre(at)hasegeli(dot)com, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Optimizer misses big in 10.4 with BRIN index
Date:	2018-08-07 07:14:56
Message-ID:	af24721c-8ad4-8edd-fd25-24818684b669@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 07/26/2018 07:27 AM, Tomas Vondra wrote:
> Arcadiy, can you provide plans with parallel query disabled? Or even
> better, produce a test case that reproduces this (using synthetic
> data, anonymized data or something like that, if needed).
So I basically spent most of the time trying to create a reproducible
case and I can say I failed. I can however reproduce this with specific
large data set with specific data distribution, but not an artificial
one. Unfortunately data is restricted so I have to obfuscate the data
structures. That said, I am able to reproduce this sporadically on local
machine after exporting from RDS 10.4. On some days reproduction happens
always, on others it's hard to reproduce at all. So here it goes in
sequential order:

** Data distribution in the table
======================
db0# SELECT (data->>'date')::numeric as date, count(1) FROM
schema0.table0 GROUP BY date ORDER BY date;
    date    | count
------------+---------
1527580800 |       5
1527753600 |      72
1527840000 |      36
1528012800 |      18
1528099200 |      72
1528185600 |    7266
1528272000 |     336
1528358400 |     230
1528444800 |      90
1528704000 |      76
1528790400 |       4
1528876800 |    9060
1528934400 |       6
1528963200 |      33
1529193600 |       6
1529280000 |      18
1529294400 | 541344
1529380800 | 1113121
1529467200 | 794082
1529553600 | 604752
1529812800 | 135252
1529899200 |   63222
1529985600 |   31134
1530072000 |   25392
1530158400 |      90
1530417600 |      48
1530504000 |     144
1530518400 |       1
1530590400 |    1958
1530604800 |       1
1530763200 |   48614
1530849600 |       9
1531022400 |       3
1531108800 |       3
1531195200 |       1
*** This is the range we're interested in
1531281600 |    3713
*******
1531627200 |     300
1531713600 |    1113
1531800000 |      30
1531886400 |   36000
1532232000 |   24900
1532577600 | 409364
1532664000 |    2050
1532836800 | 334782
1532923200 |    6771
(45 rows)

** No analysis was run after import. Let's see how we do.
=======================
db0=# EXPLAIN (analyze, verbose, costs, buffers) SELECT ctid,
(data->>'date')::numeric as date FROM schema0.table0 WHERE
(data->>'date')::numeric >= '1531267200'
                    AND (data->>'date')::numeric <= '1531353600'
                    ORDER BY ctid, (data->>'date')::numeric
                    DESC LIMIT 1000;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=1102962.86..1103079.54 rows=1000 width=38) (actual
time=7234.091..7234.697 rows=1000 loops=1)
   Output: ctid, (((data ->> 'date'::text))::numeric)
   Buffers: shared hit=9 read=350293
   -> Gather Merge (cost=1102962.86..1105002.57 rows=17482 width=38)
(actual time=7234.089..7234.589 rows=1000 loops=1)
         Output: ctid, (((data ->> 'date'::text))::numeric)
         Workers Planned: 2
         Workers Launched: 2
         Buffers: shared hit=9 read=350293
         -> Sort (cost=1101962.84..1101984.69 rows=8741 width=38)
(actual time=7227.938..7227.985 rows=964 loops=3)
               Output: ctid, (((data ->> 'date'::text))::numeric)
               Sort Key: table0.ctid, (((table0.data ->>
'date'::text))::numeric) DESC
               Sort Method: quicksort Memory: 102kB
               Buffers: shared hit=144 read=1048851
               Worker 0: actual time=7223.770..7223.820 rows=1272 loops=1
                 Buffers: shared hit=68 read=348293
               Worker 1: actual time=7226.520..7226.589 rows=1299 loops=1
                 Buffers: shared hit=67 read=350265
               -> Parallel Seq Scan on schema0.table0
(cost=0.00..1101390.58 rows=8741 width=38) (actual time=68.862..7227.366
rows=1238 loops=3)
                     Output: ctid, ((data ->> 'date'::text))::numeric
                     Filter: ((((table0.data ->>
'date'::text))::numeric >= '1531267200'::numeric) AND (((table0.data ->>
'date'::text))::numeric <= '1531353600'::numeric))
                     Rows Removed by Filter: 1397270
                     Buffers: shared hit=32 read=1048849
                     Worker 0: actual time=50.494..7223.107 rows=1272
loops=1
                       Buffers: shared hit=11 read=348293
                     Worker 1: actual time=53.408..7225.849 rows=1299
loops=1
                       Buffers: shared hit=12 read=350263
Planning time: 2.096 ms
Execution time: 7235.867 ms
(28 rows)

** Ok, no statistics, Par Seq Scan is selected, it's expected.
=======================
db0=# VACUUM ANALYZE VERBOSE schema0.table0;
INFO: vacuuming "schema0.table0"
INFO: index "date_idx" now contains 8194 row versions in 45 pages
DETAIL: 0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO: "table0": found 0 removable, 4195522 nonremovable row versions in
1048881 out of 1048881 pages
DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 1243
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 2.36 s, system: 4.84 s, elapsed: 7.48 s.
INFO: vacuuming "pg_toast.pg_toast_7888880"
INFO: index "pg_toast_7888880_index" now contains 0 row versions in 1 pages
DETAIL: 0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO: "pg_toast_7888880": found 0 removable, 0 nonremovable row
versions in 0 out of 0 pages
DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 1243
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO: analyzing "schema0.table0"
INFO: "table0": scanned 30000 of 1048881 pages, containing 120000 live
rows and 0 dead rows; 30000 rows in sample, 4195524 estimated total rows
VACUUM

** Statistics are collected. Let's explain again
=======================
db0=# EXPLAIN (analyze, verbose, costs, buffers) SELECT ctid,
(data->>'date')::numeric as date FROM schema0.table0 WHERE
(data->>'date')::numeric >= '1531267200'
                    AND (data->>'date')::numeric <= '1531353600'
                    ORDER BY ctid, (data->>'date')::numeric
                    DESC LIMIT 1000;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=2393.45..2394.66 rows=482 width=38) (actual
time=7498.674..7498.791 rows=1000 loops=1)
   Output: ctid, (((data ->> 'date'::text))::numeric)
   Buffers: shared hit=74 read=411764
   -> Sort (cost=2393.45..2394.66 rows=482 width=38) (actual
time=7498.672..7498.731 rows=1000 loops=1)
         Output: ctid, (((data ->> 'date'::text))::numeric)
         Sort Key: table0.ctid, (((table0.data ->>
'date'::text))::numeric) DESC
         Sort Method: top-N heapsort Memory: 95kB
         Buffers: shared hit=74 read=411764
         -> Bitmap Heap Scan on schema0.table0 (cost=162.16..2371.97
rows=482 width=38) (actual time=58.322..7497.656 rows=3713 loops=1)
               Output: ctid, ((data ->> 'date'::text))::numeric
               Recheck Cond: ((((table0.data ->>
'date'::text))::numeric >= '1531267200'::numeric) AND (((table0.data ->>
'date'::text))::numeric <= '1531353600'::numeric))
               Rows Removed by Index Recheck: 1643390
               Heap Blocks: lossy=411776
               Buffers: shared hit=74 read=411764
               -> Bitmap Index Scan on date_idx (cost=0.00..162.03
rows=557 width=0) (actual time=40.243..40.243 rows=4117760 loops=1)
                     Index Cond: ((((table0.data ->>
'date'::text))::numeric >= '1531267200'::numeric) AND (((table0.data ->>
'date'::text))::numeric <= '1531353600'::numeric))
                     Buffers: shared hit=62
Planning time: 1.388 ms
Execution time: 7498.907 ms
(19 rows)

** Wait... What??? "Rows Removed by Index Recheck: 1643390" but data is
almost sequential! Let's take a look at it.
=======================
db0=# SELECT ctid, (data->>'date')::numeric as date FROM schema0.table0
WHERE (data->>'date')::numeric >= '1531267200'
                    AND (data->>'date')::numeric <= '1531353600'
                    ORDER BY ctid, (data->>'date')::numeric
                    DESC LIMIT 1000;
   ctid    |    date
-----------+------------
(4456,2) | 1531281600
(4468,2) | 1531281600
(9301,2) | 1531281600
(9310,3) | 1531281600
...
(9317,4) | 1531281600
(9325,3) | 1531281600
(9325,4) | 1531281600
(9326,1) | 1531281600
(9328,1) | 1531281600
(9328,2) | 1531281600
(9345,1) | 1531281600
...
(9365,2) | 1531281600
(9386,2) | 1531281600
(9386,3) | 1531281600
(9386,4) | 1531281600
(9418,4) | 1531281600
...
(9419,4) | 1531281600
(12921,2) | 1531281600
...
(12962,2) | 1531281600
(12983,1) | 1531281600
(12983,2) | 1531281600
(12983,3) | 1531281600
(12983,4) | 1531281600
(13028,3) | 1531281600
(13028,4) | 1531281600
(14113,1) | 1531281600
(14113,2) | 1531281600
(15297,1) | 1531281600
...
(15317,2) | 1531281600
(15350,1) | 1531281600
(15350,2) | 1531281600
(15350,3) | 1531281600
(15350,4) | 1531281600
(15386,3) | 1531281600
(15386,4) | 1531281600
(17620,3) | 1531281600
...
(17648,4) | 1531281600
(17652,1) | 1531281600
...
(17653,2) | 1531281600
(17658,1) | 1531281600
...
(17659,2) | 1531281600
(21229,2) | 1531281600
(21243,1) | 1531281600
...
(21246,4) | 1531281600
(35794,2) | 1531281600
...
(35798,1) | 1531281600
(35827,4) | 1531281600
...
(35887,2) | 1531281600
(37006,3) | 1531281600
...
(37080,1) | 1531281600
(37157,2) | 1531281600
(37157,3) | 1531281600
(37157,4) | 1531281600
(39475,1) | 1531281600
...
(39552,4) | 1531281600
(39623,4) | 1531281600
(39653,1) | 1531281600
(39653,2) | 1531281600
(39653,3) | 1531281600
(39653,4) | 1531281600
(39674,1) | 1531281600
(39674,2) | 1531281600
(39674,3) | 1531281600
(39674,4) | 1531281600
(40829,3) | 1531281600
(40829,4) | 1531281600
(41946,4) | 1531281600
...
(41962,2) | 1531281600
(1000 rows)

** Ok, fine, there are large gaps. Let's make **absolutely sure** the
data is sequential.
=======================
db0=# CREATE INDEX date_ord_idx ON schema0.table0 (((table0.data ->>
'date')::numeric));
CREATE INDEX

** Not running any analysis on a new index, shouldn't matter if stats
are wrong, expecting scan and sort.
=======================
db0=# CLUSTER VERBOSE schema0.table0 USING date_ord_idx;
INFO: clustering "schema0.table0" using sequential scan and sort
INFO: "table0": found 0 removable, 4195522 nonremovable row versions in
1048881 pages
DETAIL: 0 dead row versions cannot be removed yet.
CPU: user: 873.12 s, system: 19.18 s, elapsed: 1019.36 s.
CLUSTER

** Ok, now we're clustered. Dropping order b-tree index.
=======================
db0=# DROP INDEX schema0.date_ord_idx ;
DROP INDEX

** Let's explain again
=======================
db0=# EXPLAIN (analyze, verbose, costs, buffers) SELECT ctid,
(data->>'date')::numeric as date FROM schema0.table0 WHERE
(data->>'date')::numeric >= '1531267200'
                    AND (data->>'date')::numeric <= '1531353600'
                    ORDER BY ctid, (data->>'date')::numeric
                    DESC LIMIT 1000;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=2393.45..2394.66 rows=482 width=38) (actual
time=7339.584..7339.703 rows=1000 loops=1)
   Output: ctid, (((data ->> 'date'::text))::numeric)
   Buffers: shared hit=56 read=410752
   -> Sort (cost=2393.45..2394.66 rows=482 width=38) (actual
time=7339.583..7339.634 rows=1000 loops=1)
         Output: ctid, (((data ->> 'date'::text))::numeric)
         Sort Key: table0.ctid, (((table0.data ->>
'date'::text))::numeric) DESC
         Sort Method: top-N heapsort Memory: 95kB
         Buffers: shared hit=56 read=410752
         -> Bitmap Heap Scan on schema0.table0 (cost=162.16..2371.97
rows=482 width=38) (actual time=59.252..7338.635 rows=3713 loops=1)
               Output: ctid, ((data ->> 'date'::text))::numeric
               Recheck Cond: ((((table0.data ->>
'date'::text))::numeric >= '1531267200'::numeric) AND (((table0.data ->>
'date'::text))::numeric <= '1531353600'::numeric))
               Rows Removed by Index Recheck: 1639294
               Heap Blocks: lossy=410752
               Buffers: shared hit=56 read=410752
               -> Bitmap Index Scan on date_idx (cost=0.00..162.03
rows=557 width=0) (actual time=24.545..24.545 rows=4107520 loops=1)
                     Index Cond: ((((table0.data ->>
'date'::text))::numeric >= '1531267200'::numeric) AND (((table0.data ->>
'date'::text))::numeric <= '1531353600'::numeric))
                     Buffers: shared hit=56
Planning time: 0.515 ms
Execution time: 7339.772 ms
(19 rows)

** "Rows Removed by Index Recheck: 1639294" - OK, statistics are
pooched post-clustering. Let's reanalyze.
=======================
db0=# ANALYZE VERBOSE schema0.table0;
INFO: analyzing "schema0.table0"
INFO: "table0": scanned 30000 of 1048881 pages, containing 120000 live
rows and 0 dead rows; 30000 rows in sample, 4195524 estimated total rows
ANALYZE

** Let's try this again...
=======================
db0=# EXPLAIN (analyze, verbose, costs, buffers) SELECT ctid,
(data->>'date')::numeric as date FROM schema0.table0 WHERE
(data->>'date')::numeric >= '1531267200'
                    AND (data->>'date')::numeric <= '1531353600'
                    ORDER BY ctid, (data->>'date')::numeric
                    DESC LIMIT 1000;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=5444.33..5445.71 rows=555 width=38) (actual
time=6533.824..6533.927 rows=1000 loops=1)
   Output: ctid, (((data ->> 'date'::text))::numeric)
   Buffers: shared hit=410808
   -> Sort (cost=5444.33..5445.71 rows=555 width=38) (actual
time=6533.822..6533.862 rows=1000 loops=1)
         Output: ctid, (((data ->> 'date'::text))::numeric)
         Sort Key: table0.ctid, (((table0.data ->>
'date'::text))::numeric) DESC
         Sort Method: top-N heapsort Memory: 95kB
         Buffers: shared hit=410808
         -> Bitmap Heap Scan on schema0.table0 (cost=162.22..5419.03
rows=555 width=38) (actual time=59.316..6532.913 rows=3713 loops=1)
               Output: ctid, ((data ->> 'date'::text))::numeric
               Recheck Cond: ((((table0.data ->>
'date'::text))::numeric >= '1531267200'::numeric) AND (((table0.data ->>
'date'::text))::numeric <= '1531353600'::numeric))
               Rows Removed by Index Recheck: 1639294
               Heap Blocks: lossy=410752
               Buffers: shared hit=410808
               -> Bitmap Index Scan on date_idx (cost=0.00..162.08
rows=1339 width=0) (actual time=30.859..30.859 rows=4107520 loops=1)
                     Index Cond: ((((table0.data ->>
'date'::text))::numeric >= '1531267200'::numeric) AND (((table0.data ->>
'date'::text))::numeric <= '1531353600'::numeric))
                     Buffers: shared hit=56
Planning time: 1.100 ms
Execution time: 6534.026 ms
(19 rows)

** This cannot be possible! The table must be clustered on the same
expression we cluster BRIN! Let's look at the data.
=======================
db0=# SELECT ctid, (data->>'date')::numeric as date FROM schema0.table0
WHERE (data->>'date')::numeric >= '1531267200'
                    AND (data->>'date')::numeric <= '1531353600'
                    ORDER BY ctid, (data->>'date')::numeric
                    DESC LIMIT 1000;
   ctid    |    date
-----------+------------
(4155,2) | 1531281600
...
(4159,1) | 1531281600
(4188,4) | 1531281600
(4189,1) | 1531281600
(4189,2) | 1531281600
(4189,3) | 1531281600
(4189,4) | 1531281600
(4231,4) | 1531281600
(4247,3) | 1531281600
(4247,4) | 1531281600
(4248,1) | 1531281600
(4248,2) | 1531281600
(5367,3) | 1531281600
...
(5368,4) | 1531281600
(5376,3) | 1531281600
...
(5441,1) | 1531281600
(5518,2) | 1531281600
(5518,3) | 1531281600
(5518,4) | 1531281600
(7836,1) | 1531281600
(7836,2) | 1531281600
(7851,3) | 1531281600
...
(7913,4) | 1531281600
(7984,4) | 1531281600
(8014,1) | 1531281600
(8014,2) | 1531281600
(8014,3) | 1531281600
(8014,4) | 1531281600
(8035,1) | 1531281600
(8035,2) | 1531281600
(8035,3) | 1531281600
(8035,4) | 1531281600
(9190,3) | 1531281600
(9190,4) | 1531281600
(10307,4) | 1531281600
...
(10349,3) | 1531281600
(10371,3) | 1531281600
...
(10392,2) | 1531281600
(10445,4) | 1531281600
(10446,1) | 1531281600
(10446,2) | 1531281600
(11580,3) | 1531281600
...
(11586,2) | 1531281600
(11595,3) | 1531281600
...
(11598,2) | 1531281600
(11600,1) | 1531281600
...
(11601,2) | 1531281600
(11609,1) | 1531281600
...
(11621,2) | 1531281600
(11621,3) | 1531281600
(11664,3) | 1531281600
...
(11684,3) | 1531281600
(11726,2) | 1531281600
...
(11770,4) | 1531281600
(12891,2) | 1531281600
(14057,1) | 1531281600
...
(14058,2) | 1531281600
(14064,3) | 1531281600
...
(14071,3) | 1531281600
(1000 rows)

****

The issue seems to be with both clustering and BRIN stats.

Fillfactor on the table is not a factor, can be 100 can be 75, I'm able
to reproduce and fail to reproduce with both.
Clustering on `date_ord_idx` can be done with ANALYZE and without, in
some cases PG picks scan and sort, sometimes it uses index regardless of
stats.
Sometimes the table is properly clustered, sometimes it isn't
afterwards. Sometimes if you cluster first on the unanalyzed
`date_ord_idx`, clustering doesn't fix the problem, but then analyzing
and clustering on `date_ord_idx` again does fix the issue. Sometimes
neither helps.
Full vacuum analyze may help, or may not help, depending on the phase of
the moon.
When BRIN is actually fixed, the explain has a few hundred rows
eliminated by the recheck, not >1M, and queries are 50 - 70 ms long for
date range.

If anybody can give a suggestion on where in the code to start looking,
I'll be extra-grateful.

--
Arcadiy Ivanov
arcadiy(at)gmail(dot)com | @arcivanov | https://ivanov.biz
https://github.com/arcivanov

In response to

Re: Optimizer misses big in 10.4 with BRIN index at 2018-07-26 11:27:46 from Tomas Vondra

Responses

Re: Optimizer misses big in 10.4 with BRIN index at 2018-08-09 14:44:05 from Emre Hasegeli

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Andrey Lepikhov	2018-08-07 07:19:35	Re: [WIP] [B-Tree] Retail IndexTuple deletion
Previous Message	Dean Rasheed	2018-08-07 06:52:42	Re: BUG #15307: Low numerical precision of (Co-) Variance