Re: Out of Memory errors are frustrating as heck!

From: Gunther <raj(at)gusw(dot)net>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Justin Pryzby <pryzby(at)telsasoft(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-performance(at)lists(dot)postgresql(dot)org, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Subject: Re: Out of Memory errors are frustrating as heck!
Date: 2019-04-23 20:37:50
Message-ID: 2d422e7e-3105-3320-f3d5-801bd6a8b27d@gusw.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On 4/21/2019 23:09, Tomas Vondra wrote:
> What I think might work better is the attached v2 of the patch, with a
> single top-level condition, comparing the combined memory usage
> (spaceUsed + BufFile) against spaceAllowed. But it also tweaks
> spaceAllowed once the size needed for BufFile gets over work_mem/3.

Thanks for this, and I am trying this now.

So far it is promising.

I see the memory footprint contained under 1 GB. I see it go up, but
also down again. CPU, IO, all being live.

foo=# set enable_nestloop to off; SET foo=# explain analyze select *
from reports.v_BusinessOperation; WARNING: ExecHashIncreaseNumBatches:
nbatch=32 spaceAllowed=4194304 WARNING: ExecHashIncreaseNumBatches:
nbatch=64 spaceAllowed=4194304 WARNING: ExecHashIncreaseNumBatches:
nbatch=128 spaceAllowed=4194304 WARNING: ExecHashIncreaseNumBatches:
nbatch=256 spaceAllowed=6291456 WARNING: ExecHashIncreaseNumBatches:
nbatch=512 spaceAllowed=12582912 WARNING: ExecHashIncreaseNumBatches:
nbatch=1024 spaceAllowed=25165824 WARNING: ExecHashIncreaseNumBatches:
nbatch=2048 spaceAllowed=50331648 WARNING: ExecHashIncreaseNumBatches:
nbatch=4096 spaceAllowed=100663296 WARNING: ExecHashIncreaseNumBatches:
nbatch=8192 spaceAllowed=201326592 WARNING: ExecHashIncreaseNumBatches:
nbatch=16384 spaceAllowed=402653184 WARNING: ExecHashIncreaseNumBatches:
nbatch=32768 spaceAllowed=805306368 WARNING: ExecHashIncreaseNumBatches:
nbatch=65536 spaceAllowed=1610612736

Aaaaaand, it's a winner!

Unique (cost=5551524.36..5554207.33 rows=34619 width=1197) (actual
time=6150303.060..6895451.210 rows=435274 loops=1) -> Sort
(cost=5551524.36..5551610.91 rows=34619 width=1197) (actual
time=6150303.058..6801372.192 rows=113478386 loops=1) Sort Key:
documentinformationsubject.documentinternalid,
documentinformationsubject.is_current,
documentinformationsubject.documentid,
documentinformationsubject.documenttypecode,
documentinformationsubject.subjectroleinternalid,
documentinformationsubject.subjectentityinternalid,
documentinformationsubject.subjectentityid,
documentinformationsubject.subjectentityidroot,
documentinformationsubject.subjectentityname,
documentinformationsubject.subjectentitytel,
documentinformationsubject.subjectentityemail,
documentinformationsubject.otherentityinternalid,
documentinformationsubject.confidentialitycode,
documentinformationsubject.actinternalid,
documentinformationsubject.code_code,
documentinformationsubject.code_displayname, q.code_code,
q.code_displayname, an.extension, an.root,
documentinformationsubject_2.subjectentitycode,
documentinformationsubject_2.subjectentitycodesystem,
documentinformationsubject_2.effectivetime_low,
documentinformationsubject_2.effectivetime_high,
documentinformationsubject_2.statuscode,
documentinformationsubject_2.code_code, agencyid.extension,
agencyname.trivialname, documentinformationsubject_1.subjectentitycode,
documentinformationsubject_1.subjectentityinternalid Sort Method:
external merge Disk: 40726720kB -> Hash Right Join
(cost=4255031.53..5530808.71 rows=34619 width=1197) (actual
time=325240.679..1044194.775 rows=113478386 loops=1) Hash Cond:
(((q.documentinternalid)::text =
(documentinformationsubject.documentinternalid)::text) AND
((r.targetinternalid)::text =
(documentinformationsubject.actinternalid)::text)) -> Hash Right Join
(cost=1341541.37..2612134.36 rows=13 width=341) (actual
time=81093.327..81093.446 rows=236 loops=1) Hash Cond:
(((documentinformationsubject_2.documentinternalid)::text =
(q.documentinternalid)::text) AND
((documentinformationsubject_2.actinternalid)::text =
(q.actinternalid)::text)) -> Gather (cost=31291.54..1301884.52 rows=1
width=219) (actual time=41920.563..41929.780 rows=0 loops=1) Workers
Planned: 2 Workers Launched: 2 -> Parallel Hash Left Join
(cost=30291.54..1300884.42 rows=1 width=219) (actual
time=41915.960..41915.960 rows=0 loops=3) Hash Cond:
((documentinformationsubject_2.otherentityinternalid)::text =
(agencyid.entityinternalid)::text) -> Parallel Hash Left Join
(cost=28606.13..1299199.00 rows=1 width=204) (actual
time=41862.767..41862.769 rows=0 loops=3) Hash Cond:
((documentinformationsubject_2.otherentityinternalid)::text =
(agencyname.entityinternalid)::text) -> Parallel Seq Scan on
documentinformationsubject documentinformationsubject_2
(cost=0.00..1268800.85 rows=1 width=177) (actual
time=40805.337..40805.337 rows=0 loops=3) Filter:
((participationtypecode)::text = 'AUT'::text) Rows Removed by Filter:
2815562 -> Parallel Hash (cost=24733.28..24733.28 rows=166628 width=64)
(actual time=981.000..981.001 rows=133303 loops=3) Buckets: 65536
Batches: 16 Memory Usage: 3136kB -> Parallel Seq Scan on bestname
agencyname (cost=0.00..24733.28 rows=166628 width=64) (actual
time=0.506..916.816 rows=133303 loops=3) -> Parallel Hash
(cost=1434.07..1434.07 rows=20107 width=89) (actual time=52.350..52.350
rows=11393 loops=3) Buckets: 65536 Batches: 1 Memory Usage: 4680kB ->
Parallel Seq Scan on entity_id agencyid (cost=0.00..1434.07 rows=20107
width=89) (actual time=0.376..46.875 rows=11393 loops=3) -> Hash
(cost=1310249.63..1310249.63 rows=13 width=233) (actual
time=39172.682..39172.682 rows=236 loops=1) Buckets: 1024 Batches: 1
Memory Usage: 70kB -> Hash Right Join (cost=829388.20..1310249.63
rows=13 width=233) (actual time=35084.850..39172.545 rows=236 loops=1)
Hash Cond: ((an.actinternalid)::text = (q.actinternalid)::text) -> Seq
Scan on act_id an (cost=0.00..425941.04 rows=14645404 width=134) (actual
time=0.908..7583.123 rows=14676871 loops=1) -> Hash
(cost=829388.19..829388.19 rows=1 width=136) (actual
time=29347.614..29347.614 rows=236 loops=1) Buckets: 1024 Batches: 1
Memory Usage: 63kB -> Gather (cost=381928.46..829388.19 rows=1
width=136) (actual time=23902.428..29347.481 rows=236 loops=1) Workers
Planned: 2 Workers Launched: 2 -> Parallel Hash Join
(cost=380928.46..828388.09 rows=1 width=136) (actual
time=23915.790..29336.452 rows=79 loops=3) Hash Cond:
((q.actinternalid)::text = (r.sourceinternalid)::text) -> Parallel Seq
Scan on documentinformation q (cost=0.00..447271.93 rows=50050 width=99)
(actual time=10055.238..15484.478 rows=87921 loops=3) Filter:
(((classcode)::text = 'CNTRCT'::text) AND ((moodcode)::text =
'EVN'::text) AND ((code_codesystem)::text =
'2.16.840.1.113883.3.26.1.1'::text)) Rows Removed by Filter: 1540625 ->
Parallel Hash (cost=380928.44..380928.44 rows=1 width=74) (actual
time=13825.726..13825.726 rows=79 loops=3) Buckets: 1024 Batches: 1
Memory Usage: 112kB -> Parallel Seq Scan on actrelationship r
(cost=0.00..380928.44 rows=1 width=74) (actual time=5289.948..13825.576
rows=79 loops=3) Filter: ((typecode)::text = 'SUBJ'::text) Rows Removed
by Filter: 3433326 -> Hash (cost=2908913.87..2908913.87 rows=34619
width=930) (actual time=244145.322..244145.322 rows=113478127 loops=1)
Buckets: 8192 (originally 8192) Batches: 65536 (originally 16) Memory
Usage: 1204250kB -> Gather Merge (cost=2892141.40..2908913.87 rows=34619
width=930) (actual time=75215.333..145622.427 rows=113478127 loops=1)
Workers Planned: 2 Workers Launched: 2 -> Merge Left Join
(cost=2891141.37..2903917.96 rows=14425 width=930) (actual
time=75132.988..99411.448 rows=37826042 loops=3) Merge Cond:
(((documentinformationsubject.documentinternalid)::text =
(documentinformationsubject_1.documentinternalid)::text) AND
((documentinformationsubject.documentid)::text =
(documentinformationsubject_1.documentid)::text) AND
((documentinformationsubject.actinternalid)::text =
(documentinformationsubject_1.actinternalid)::text)) -> Sort
(cost=1301590.26..1301626.32 rows=14425 width=882) (actual
time=39801.337..40975.780 rows=231207 loops=3) Sort Key:
documentinformationsubject.documentinternalid,
documentinformationsubject.documentid,
documentinformationsubject.actinternalidct_1.documentid,
documentinformationsubject_1.actinternalid Sort Method: external merge
Disk: 169768kB Worker 0: Sort Method: external merge Disk: 169768kB
Worker 1: Sort Method: external merge Disk: 169768kB -> Seq Scan on
documentinformationsubject documentinformationsubject_1
(cost=0.00..1329868.64 rows=1010585 width=159) (actual
time=23401.537..31758.042 rows=1031106 loops=3) Filter:
((participationtypecode)::text = 'PRD'::text) Rows Removed by Filter:
7415579 Planning Time: 40.559 ms Execution Time: 6896581.566 ms (70 rows)

For the first time this query has succeeded now. Memory was bounded. The
time of nearly hours is crazy, but things sometimes take that long. The
important thing was not to get an out of memory error.

Thank you. Anything else you want to try, I can do it.

regards,
-Gunther

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Justin Pryzby 2019-04-23 20:43:48 Re: Out of Memory errors are frustrating as heck!
Previous Message Tom Lane 2019-04-22 17:57:34 Re: Out of Memory errors are frustrating as heck!