BRIN index which is much faster never chosen by planner

From: Jeremy Finzel <finzelj(at)gmail(dot)com>
To: PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>, Michael Lewis <mlewis(at)entrata(dot)com>
Subject: BRIN index which is much faster never chosen by planner
Date: 2019-10-10 21:58:11
Message-ID: CAMa1XUj8WLszW8L2piuPZ88hh+CWitFu-qtf-6TR5rPw5xvhMA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Good Afternoon,

I posted about this on another thread here
<https://www.postgresql.org/message-id/CAMa1XUiH3hL3KGwdKGjnJdJeo2A5H1o1uhtXWBkmMqixrDCWMA@mail.gmail.com>,
but the topic was not precisely planner issues, so I wanted to post it here.

I am running Postgres 11.5. I have a table that is insert-only and has 312
million rows. It is also pruned continuously to only past year. The size
of the table is 223 GB with indexes, 140GB without. One of the fields is
rec_insert_time timestamptz. Here are all potentially relevant table stats:

schemaname | foo
relname | log_table
n_tup_ins | 86850506
n_tup_upd | 0
n_tup_del | 68916115
n_tup_hot_upd | 0
n_live_tup | 312810691
n_dead_tup | 9405132
n_mod_since_analyze | 11452
last_vacuum | 2019-09-20 09:41:43.78383-05
last_autovacuum | 2019-10-04 13:56:16.810355-05
last_analyze | 2019-10-10 09:34:26.807695-05
last_autoanalyze |
vacuum_count | 2
autovacuum_count | 1
analyze_count | 13
autoanalyze_count | 0
total_relation_size | 223 GB
relation_size | 139 GB
table_size | 140 GB

I have a simple query looking at past 10 days based on rec_insert_time, and
it will not choose the BRIN index even with several configurations. Here
are my all relevant indexes (I intentionally do not have a btree on
rec_insert_time because I believe BRIN *should* fit better here):

"log_table_brand_id_product_rec_insert_time_idx" btree (brand_id, product,
rec_insert_time)
"log_table_rec_insert_time_idx" brin (rec_insert_time)
"log_table_rec_insert_time_idx1" brin (rec_insert_time) WITH
(pages_per_range='64')
"rec_insert_time_brin_1000" brin (rec_insert_time) WITH
(pages_per_range='1000')

And here is the SQL:
SELECT
category, source, MIN(rec_insert_time) OVER (partition by source order by
rec_insert_time) AS first_source_time, MAX(rec_insert_time) OVER (partition
by source order by rec_insert_time) AS last_source_time
FROM (SELECT DISTINCT ON (brand_id, last_change, log_id)
category, source(field1) AS source, rec_insert_time
FROM log_table l
INNER JOIN public.small_join_table filter ON filter.category = l.category
WHERE field1 IS NOT NULL AND l.category = 'music'
AND l.rec_insert_time >= now() - interval '10 days'
ORDER BY brand_id, last_change, log_id, rec_insert_time DESC) unique_cases
;

This query will choose a seq scan on log_table every time in spite of these
BRIN indexes on rec_insert_time.
@Michael Lewis <mlewis(at)entrata(dot)com> had suggested I check
default_statistics_target for this column. I raised it to 5000 for this
column and still it's choosing a seq scan.

Here is default chosen plan (takes 2 minutes 12 seconds):
WindowAgg (cost=24437881.80..24437897.70 rows=707 width=120) (actual
time=132173.173..132173.222 rows=53 loops=1)
Output: unique_cases.category, unique_cases.source,
min(unique_cases.rec_insert_time) OVER (?),
max(unique_cases.rec_insert_time) OVER (?), unique_cases.rec_insert_time
Buffers: shared hit=391676 read=17772642 dirtied=4679 written=7
-> Sort (cost=24437881.80..24437883.56 rows=707 width=104) (actual
time=132173.146..132173.149 rows=53 loops=1)
Output: unique_cases.source, unique_cases.rec_insert_time,
unique_cases.category
Sort Key: unique_cases.source, unique_cases.rec_insert_time
Sort Method: quicksort Memory: 32kB
Buffers: shared hit=391676 read=17772642 dirtied=4679 written=7
-> Subquery Scan on unique_cases (cost=24437834.20..24437848.34
rows=707 width=104) (actual time=132172.950..132173.062 rows=53 loops=1)
Output: unique_cases.source, unique_cases.rec_insert_time,
unique_cases.category
Buffers: shared hit=391676 read=17772642 dirtied=4679
written=7
-> Unique (cost=24437834.20..24437841.27 rows=707
width=124) (actual time=132172.946..132173.048 rows=53 loops=1)
Output: l.category, (source(l.field1)),
l.rec_insert_time, l.brand_id, l.last_change, l.log_id
Buffers: shared hit=391676 read=17772642 dirtied=4679
written=7
-> Sort (cost=24437834.20..24437835.96 rows=707
width=124) (actual time=132172.939..132172.962 rows=466 loops=1)
Output: l.category, (source(l.field1)),
l.rec_insert_time, l.brand_id, l.last_change, l.log_id
Sort Key: l.brand_id, l.last_change, l.log_id,
l.rec_insert_time DESC
Sort Method: quicksort Memory: 90kB
Buffers: shared hit=391676 read=17772642
dirtied=4679 written=7
-> Nested Loop (cost=0.00..24437800.73
rows=707 width=124) (actual time=4096.253..132171.425 rows=466 loops=1)
Output: l.category, source(l.field1),
l.rec_insert_time, l.brand_id, l.last_change, l.log_id
Inner Unique: true
Join Filter: ((l.category)::text =
filter.category)
Rows Removed by Join Filter: 346704
Buffers: shared hit=391676 read=17772642
dirtied=4679 written=7
-> Seq Scan on foo.log_table l
(cost=0.00..24420483.80 rows=707 width=99) (actual
time=4095.763..132112.686 rows=466 loops=1)
Output: <hidden>
Filter: ((l.field1 IS NOT NULL) AND
(l.category = 'music'::name) AND (l.rec_insert_time >= (now() - '10
days'::interval)))
Rows Removed by Filter: 312830265
Buffers: shared hit=391675
read=17772636 dirtied=4679 written=7
-> Materialize (cost=0.00..33.98
rows=1399 width=8) (actual time=0.001..0.036 rows=745 loops=466)
Output: filter.category
Buffers: shared hit=1 read=6
-> Seq Scan on
public.small_join_table filter (cost=0.00..26.99 rows=1399 width=8)
(actual time=0.054..0.189 rows=745 loops=1)
Output: filter.category
Buffers: shared hit=1 read=6
Planning Time: 0.552 ms
Execution Time: 132173.657 ms
(38 rows)

Here is the plan I get when I turn off seqscan and indexscan!!! (21
seconds):
SET enable_seqscan TO false;
SET enable_indexscan TO false;

WindowAgg (cost=24363224.32..24363240.85 rows=735 width=120) (actual
time=21337.992..21338.040 rows=53 loops=1)
Output: unique_cases.category, unique_cases.source,
min(unique_cases.rec_insert_time) OVER (?),
max(unique_cases.rec_insert_time) OVER (?), unique_cases.rec_insert_time
Buffers: shared hit=1471 read=1509030 dirtied=121 written=1631
-> Sort (cost=24363224.32..24363226.15 rows=735 width=104) (actual
time=21337.965..21337.968 rows=53 loops=1)
Output: unique_cases.source, unique_cases.rec_insert_time,
unique_cases.category
Sort Key: unique_cases.source, unique_cases.rec_insert_time
Sort Method: quicksort Memory: 32kB
Buffers: shared hit=1471 read=1509030 dirtied=121 written=1631
-> Subquery Scan on unique_cases (cost=24363174.62..24363189.32
rows=735 width=104) (actual time=21337.777..21337.889 rows=53 loops=1)
Output: unique_cases.source, unique_cases.rec_insert_time,
unique_cases.category
Buffers: shared hit=1471 read=1509030 dirtied=121
written=1631
-> Unique (cost=24363174.62..24363181.97 rows=735
width=124) (actual time=21337.772..21337.874 rows=53 loops=1)
Output: l.category, (source(l.field1)),
l.rec_insert_time, l.brand_id, l.last_change, l.log_id
Buffers: shared hit=1471 read=1509030 dirtied=121
written=1631
-> Sort (cost=24363174.62..24363176.46 rows=735
width=124) (actual time=21337.767..21337.791 rows=466 loops=1)
Output: l.category, (source(l.field1)),
l.rec_insert_time, l.brand_id, l.last_change, l.log_id
Sort Key: l.brand_id, l.last_change, l.log_id,
l.rec_insert_time DESC
Sort Method: quicksort Memory: 90kB
Buffers: shared hit=1471 read=1509030
dirtied=121 written=1631
-> Nested Loop (cost=2393.31..24363139.63
rows=735 width=124) (actual time=824.212..21336.263 rows=466 loops=1)
Output: l.category, source(l.field1),
l.rec_insert_time, l.brand_id, l.last_change, l.log_id
Inner Unique: true
Buffers: shared hit=1471 read=1509030
dirtied=121 written=1631
-> Bitmap Heap Scan on foo.log_table l
(cost=2391.71..24360848.29 rows=735 width=99) (actual
time=824.133..21329.054 rows=466 loops=1)
Output: <hidden>
Recheck Cond: (l.rec_insert_time >=
(now() - '10 days'::interval))
Rows Removed by Index Recheck:
8187584
Filter: ((l.field1 IS NOT NULL) AND
(l.category = 'music'::name))
Rows Removed by Filter: 19857107
Heap Blocks: lossy=1509000
Buffers: shared hit=73 read=1509030
dirtied=121 written=1631
-> Bitmap Index Scan on
rec_insert_time_brin_1000 (cost=0.00..2391.52 rows=157328135 width=0)
(actual time=72.391..72.391 rows=15090000 loops=1)
Index Cond: (l.rec_insert_time
>= (now() - '10 days'::interval))
Buffers: shared hit=29 read=74
-> Bitmap Heap Scan on
public.small_join_table filter (cost=1.60..3.12 rows=1 width=8) (actual
time=0.010..0.010 rows=1 loops=466)
Output: filter.category,
filter.type, filter.location
Recheck Cond: (filter.category =
(l.category)::text)
Heap Blocks: exact=466
Buffers: shared hit=1398
-> Bitmap Index Scan on
small_join_table_pkey (cost=0.00..1.60 rows=1 width=0) (actual
time=0.007..0.007 rows=1 loops=466)
Index Cond: (filter.category =
(l.category)::text)
Buffers: shared hit=932
Planning Time: 1.869 ms
Execution Time: 21338.244 ms
(44 rows)

Notice it chooses the smallest BRIN index with 1000 pages per range, and
this is far faster than the seq scan.

I do believe the estimate is actually way off. Just a plain EXPLAIN of the
latter estimates 10x more rows than actual:
WindowAgg (cost=24354689.19..24354705.07 rows=706 width=120)
-> Sort (cost=24354689.19..24354690.95 rows=706 width=104)
Sort Key: unique_cases.source, unique_cases.rec_insert_time
-> Subquery Scan on unique_cases (cost=24354641.66..24354655.78
rows=706 width=104)
-> Unique (cost=24354641.66..24354648.72 rows=706
width=124)
-> Sort (cost=24354641.66..24354643.42 rows=706
width=124)
Sort Key: l.brand_id, l.last_change, l.log_id,
l.rec_insert_time DESC
-> Nested Loop (cost=2385.42..24354608.25
rows=706 width=124)
-> Bitmap Heap Scan on log_table l
(cost=2383.82..24352408.26 rows=706 width=99)
Recheck Cond: (rec_insert_time >=
(now() - '10 days'::interval))
Filter: ((field1 IS NOT NULL) AND
(category = 'music'::name))
-> Bitmap Index Scan on
rec_insert_time_brin_1000 (cost=0.00..2383.64 rows=156577455 width=0)
Index Cond: (rec_insert_time
>= (now() - '10 days'::interval))
-> Bitmap Heap Scan on small_join_table
filter (cost=1.60..3.12 rows=1 width=8)
Recheck Cond: (category =
(l.category)::text)
-> Bitmap Index Scan on
small_join_table_pkey (cost=0.00..1.60 rows=1 width=0)
Index Cond: (category =
(l.category)::text)
(17 rows)

Here is EXPLAIN only of the default chosen plan:
WindowAgg (cost=24437857.18..24437873.07 rows=706 width=120)
-> Sort (cost=24437857.18..24437858.95 rows=706 width=104)
Sort Key: unique_cases.source, unique_cases.rec_insert_time
-> Subquery Scan on unique_cases (cost=24437809.66..24437823.78
rows=706 width=104)
-> Unique (cost=24437809.66..24437816.72 rows=706
width=124)
-> Sort (cost=24437809.66..24437811.42 rows=706
width=124)
Sort Key: l.brand_id, l.last_change, l.log_id,
l.rec_insert_time DESC
-> Nested Loop (cost=0.00..24437776.25
rows=706 width=124)
Join Filter: ((l.category)::text =
filter.category)
-> Seq Scan on log_table l
(cost=0.00..24420483.80 rows=706 width=99)
Filter: ((field1 IS NOT NULL) AND
(category = 'music'::name) AND (rec_insert_time >= (now() - '10
days'::interval)))
-> Materialize (cost=0.00..33.98
rows=1399 width=8)
-> Seq Scan on small_join_table
filter (cost=0.00..26.99 rows=1399 width=8)
(13 rows)

Any insight into this is much appreciated. This is just one example of
many similar issues I have been finding with BRIN indexes scaling
predictably with insert-only workloads.

Thanks!
Jeremy

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2019-10-10 22:01:14 Re: stress test for parallel workers
Previous Message Mark Wong 2019-10-10 21:53:51 Re: stress test for parallel workers