Re: Bogus ANALYZE results for an otherwise-unique column with many nulls

From: Andreas Joseph Krogh <andreas(at)visena(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bogus ANALYZE results for an otherwise-unique column with many nulls
Date: 2016-08-05 09:43:04
Message-ID: VisenaEmail.5b.208c70dd7d62e340.15659a64f41@tc7-visena
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

På fredag 05. august 2016 kl. 01:01:06, skrev Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us
<mailto:tgl(at)sss(dot)pgh(dot)pa(dot)us>>:
I wrote:
> Looking around, there are a couple of places outside commands/analyze.c
> that are making the same mistake, so this patch isn't complete, but it
> illustrates what needs to be done.

Here's a more complete patch.

regards, tom lane
 
(Sorry for HTML-email, but we live in 2016, and I think it improves
readability)
 
Great!
 
I have tested it (with PG-9.6 from HEAD,
e7caacf733f3ee77a555aa715ab6fbf4434e6b52) and it sure looks like it fixes the
problem for my query.
 
The query:
 
explain analyze SELECT log.relation_id as company_id , sum(log.duration) AS
durationFROM onp_crm_activity_log log JOIN onp_crm_person logfor ON
logfor.onp_user_id =log.logged_for AND logfor.is_resource = FALSE WHERE 1 = 1
-- Filter out already invoiced AND NOT EXISTS( SELECT * FROM
onp_crm_calendarentry_invoice_membership cemJOIN onp_crm_invoice_line il ON
cem.invoice_line_id = il.idJOIN onp_crm_invoice inv ON il.invoice_id =
inv.entity_idWHERE cem.calendar_entry_id = log.id AND NOT EXISTS( SELECT * FROM
onp_crm_invoice creditnoteWHERE il.invoice_id = creditnote.credit_against ) )
GROUP BY log.relation_id ;
 
Gave the following explain-plan (before the patch), with enable_nestloop=off
(needed to prevent nest-loop-anti-join which caused this query to execute in
~26 min.):
 
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=25201.62..25220.06 rows=1475 width=36) (actual time
=381.191..381.281rows=404 loops=1) Group Key: log.relation_id -> Hash Anti Join
(cost=4782.34..25148.28 rows=10667 width=12) (actual time=103.683..361.409 rows=
148330loops=1) Hash Cond: (log.id = cem.calendar_entry_id) -> Hash Join (cost
=81.46..20312.73rows=10667 width=20) (actual time=0.100..156.432 rows=380617
loops=1) Hash Cond: (log.logged_for = logfor.onp_user_id) -> Seq Scan on
onp_crm_activity_loglog (cost=0.00..18696.71 rows=380771 width=24) (actual time
=0.005..49.195rows=380774 loops=1) -> Hash (cost=25.02..25.02 rows=4515 width=8
) (actualtime=0.078..0.078 rows=119 loops=1) Buckets: 8192 Batches: 1 Memory
Usage: 69kB -> Index Scan using onp_crm_person_onp_id_idx on onp_crm_person
logfor (cost=0.14..25.02 rows=4515 width=8) (actual time=0.005..0.063 rows=119
loops=1) Filter: (NOT is_resource) Rows Removed by Filter: 8 -> Hash (cost
=4700.87..4700.87rows=1 width=4) (actual time=103.575..103.575 rows=232412
loops=1) Buckets: 262144 (originally 1024) Batches: 1 (originally 1) Memory
Usage: 10219kB -> Hash Join (cost=474.41..4700.87 rows=1 width=4) (actual time
=9.724..76.015rows=232412 loops=1) Hash Cond: (cem.invoice_line_id = il.id) ->
Seq Scanon onp_crm_calendarentry_invoice_membership cem (cost=0.00..3354.28 rows
=232528 width=8) (actual time=0.004..17.626 rows=232528 loops=1) -> Hash (cost
=474.40..474.40rows=1 width=4) (actual time=9.710..9.710 rows=11535 loops=1)
Buckets:16384 (originally 1024) Batches: 1 (originally 1) Memory Usage: 534kB
-> MergeJoin (cost=415.33..474.40 rows=1 width=4) (actual time=4.149..8.467 rows
=11535 loops=1) Merge Cond: (il.invoice_id = inv.entity_id) -> Sort (cost
=415.05..415.06rows=1 width=8) (actual time=4.138..4.701 rows=11535 loops=1)
SortKey: il.invoice_id Sort Method: quicksort Memory: 925kB -> Hash Anti Join (
cost=77.13..415.04 rows=1 width=8) (actual time=0.257..2.716 rows=11535 loops=1)
HashCond: (il.invoice_id = creditnote.credit_against) -> Seq Scan on
onp_crm_invoice_line il (cost=0.00..294.30 rows=11630 width=8) (actual time
=0.003..0.995rows=11630 loops=1) -> Hash (cost=50.38..50.38 rows=2140 width=4)
(actualtime=0.247..0.247 rows=37 loops=1) Buckets: 4096 Batches: 1 Memory Usage:
34kB -> Index Only Scan using origo_invoice_credit_against_idx on
onp_crm_invoice creditnote (cost=0.28..50.38 rows=2140 width=4) (actual time
=0.013..0.182rows=2140 loops=1) Heap Fetches: 0 -> Index Only Scan using
origo_invoice_id_status_sent_idxon onp_crm_invoice inv (cost=0.28..53.98 rows=
2140width=8) (actual time=0.008..1.274 rows=11576 loops=1) Heap Fetches: 0
Planningtime: 0.955 ms Execution time: 381.884 ms (35 rows)
 
AFAIU, this is the problematic estimate: -> Hash Anti Join (cost=77.13..415.04
rows=1 width=8) (actual time=0.257..2.716 rows=11535 loops=1)
Right?
 
now (after the patch, and without needing to disable nest_loop) gives the
following explain-plan:
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=44409.89..44428.47 rows=1486 width=36) (actual time
=366.502..366.594rows=404 loops=1) Group Key: log.relation_id -> Hash Anti Join
(cost=10157.30..43650.11 rows=151956 width=12) (actual time=99.420..347.309 rows
=148327 loops=1) Hash Cond: (log.id = cem.calendar_entry_id) -> Merge Join (cost
=0.56..26729.34rows=379984 width=20) (actual time=0.011..150.415 rows=380614
loops=1) Merge Cond: (logfor.onp_user_id = log.logged_for) -> Index Scan using
onp_crm_person_onp_id_idxon onp_crm_person logfor (cost=0.14..25.02 rows=4515
width=8) (actual time=0.006..0.030 rows=35 loops=1) Filter: (NOT is_resource)
RowsRemoved by Filter: 2 -> Index Scan using origo_activity_log_logged_for_idx
ononp_crm_activity_log log (cost=0.42..21977.16 rows=380771 width=24) (actual
time=0.003..69.406 rows=380623 loops=1) -> Hash (cost=7300.48..7300.48 rows=
228500width=4) (actual time=99.226..99.226 rows=232412 loops=1) Buckets: 262144
Batches:1 Memory Usage: 10219kB -> Hash Join (cost=789.13..7300.48 rows=228500
width=4) (actual time=8.043..67.079 rows=232412 loops=1) Hash Cond:
(cem.invoice_line_id = il.id) -> Seq Scanon
onp_crm_calendarentry_invoice_membership cem (cost=0.00..3354.28 rows=232528
width=8) (actual time=0.007..16.230 rows=232528 loops=1) -> Hash (cost
=646.27..646.27rows=11429 width=4) (actual time=8.003..8.003 rows=11535 loops=1
) Buckets:16384 Batches: 1 Memory Usage: 534kB -> Merge Anti Join (cost
=0.85..646.27rows=11429 width=4) (actual time=0.021..6.393 rows=11535 loops=1)
Merge Cond: (il.invoice_id = creditnote.credit_against) -> MergeJoin (cost
=0.57..613.99rows=11630 width=8) (actual time=0.010..4.771 rows=11630 loops=1)
Merge Cond: (inv.entity_id = il.invoice_id) ->Index Only Scan using
origo_invoice_id_status_sent_idxon onp_crm_invoice inv (cost=0.28..53.98 rows=
2140width=8) (actual time=0.006..0.318 rows=2140 loops=1) Heap Fetches: 0 ->
IndexScan using onp_crm_invoice_line_invoice_id_lineno_key on
onp_crm_invoice_line il (cost=0.29..409.28 rows=11630 width=8) (actual time
=0.003..2.323rows=11630 loops=1) -> Index Only Scan using
onp_crm_invoice_credit_against_keyon onp_crm_invoice creditnote (cost
=0.28..47.98rows=2140 width=4) (actual time=0.011..0.016 rows=38 loops=1) Heap
Fetches:0 Planning time: 0.931 ms Execution time: 366.710 ms (28 rows)
 
And, for fun with max_parallel_workers_per_gather = 6
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Finalize GroupAggregate (cost=39788.73..39851.89 rows=1486 width=36) (actual
time=234.289..234.747 rows=404 loops=1) Group Key: log.relation_id -> Sort (cost
=39788.73..39799.88rows=4458 width=36) (actual time=234.276..234.307 rows=1000
loops=1) Sort Key: log.relation_id Sort Method: quicksort Memory: 103kB ->
Gather (cost=39054.15..39518.53 rows=4458 width=36) (actual time
=233.012..234.110rows=1000 loops=1) Workers Planned: 3 Workers Launched: 3 ->
PartialHashAggregate (cost=38054.15..38072.73 rows=1486 width=36) (actual time
=225.607..225.701rows=250 loops=4) Group Key: log.relation_id -> Hash Join (cost
=10238.19..37294.37rows=151956 width=12) (actual time=163.060..219.652 rows=
37082loops=4) Hash Cond: (log.logged_for = logfor.onp_user_id) -> Hash Anti Join
(cost=10156.73..28460.48 rows=152271 width=16) (actual time=162.618..212.351
rows=37121 loops=4) Hash Cond: (log.id = cem.calendar_entry_id) -> Parallel Seq
Scanon onp_crm_activity_log log (cost=0.00..16117.29 rows=122829 width=24)
(actualtime=0.010..15.532 rows=95193 loops=4) -> Hash (cost=7300.48..7300.48
rows=228500 width=4) (actual time=161.270..161.270 rows=232412 loops=4) Buckets:
262144Batches: 1 Memory Usage: 10219kB -> Hash Join (cost=789.13..7300.48 rows=
228500width=4) (actual time=18.276..112.034 rows=232412 loops=4) Hash Cond:
(cem.invoice_line_id = il.id) -> Seq Scanon
onp_crm_calendarentry_invoice_membership cem (cost=0.00..3354.28 rows=232528
width=8) (actual time=0.045..26.061 rows=232528 loops=4) -> Hash (cost
=646.27..646.27rows=11429 width=4) (actual time=18.116..18.116 rows=11535 loops=
4) Buckets: 16384 Batches: 1 Memory Usage: 534kB -> Merge Anti Join (cost
=0.85..646.27rows=11429 width=4) (actual time=0.179..14.505 rows=11535 loops=4)
Merge Cond: (il.invoice_id = creditnote.credit_against) -> MergeJoin (cost
=0.57..613.99rows=11630 width=8) (actual time=0.113..10.871 rows=11630 loops=4)
Merge Cond: (inv.entity_id = il.invoice_id) ->Index Only Scan using
origo_invoice_id_status_sent_idxon onp_crm_invoice inv (cost=0.28..53.98 rows=
2140width=8) (actual time=0.063..0.811 rows=2140 loops=4) Heap Fetches: 0 ->
IndexScan using onp_crm_invoice_line_invoice_id_lineno_key on
onp_crm_invoice_line il (cost=0.29..409.28 rows=11630 width=8) (actual time
=0.046..5.213rows=11630 loops=4) -> Index Only Scan using
onp_crm_invoice_credit_against_keyon onp_crm_invoice creditnote (cost
=0.28..47.98rows=2140 width=4) (actual time=0.063..0.071 rows=38 loops=4) Heap
Fetches:0 -> Hash (cost=25.02..25.02 rows=4515 width=8) (actual time
=0.325..0.325rows=119 loops=4) Buckets: 8192 Batches: 1 Memory Usage: 69kB ->
IndexScan using onp_crm_person_onp_id_idx on onp_crm_person logfor (cost
=0.14..25.02rows=4515 width=8) (actual time=0.032..0.270 rows=119 loops=4)
Filter: (NOT is_resource) Rows Removed by Filter: 8 Planning time: 2.443 ms
Executiontime: 239.979 ms (38 rows)
 
All in all, thanks for looking into this, and +1 for backpatching to 9.5.
 
-- Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
andreas(at)visena(dot)com <mailto:andreas(at)visena(dot)com>
www.visena.com <https://www.visena.com>
<https://www.visena.com>

 

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tsunakawa, Takayuki 2016-08-05 10:25:45 Re: [RFC] Change the default of update_process_title to off
Previous Message Murat Tuncer 2016-08-05 09:21:40 truncate trigger for foreign data wrappers