Re: Out of Memory errors are frustrating as heck!

From: Gunther <raj(at)gusw(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, pgsql-performance(at)lists(dot)postgresql(dot)org, Justin Pryzby <pryzby(at)telsasoft(dot)com>
Subject: Re: Out of Memory errors are frustrating as heck!
Date: 2019-04-18 03:52:44
Message-ID: 6083f313-2610-53ec-66f5-30d4b4425341@gusw.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi guys. I don't want to be pushy, but I found it strange that after so
much lively back and forth getting to the bottom of this, suddenly my
last nights follow-up remained completely without reply. I wonder if it
even got received. For those who read their emails with modern readers
(I know I too am from a time where I wrote everything in plain text) I
marked some important questions in bold.

On 4/16/2019 11:30, Tom Lane wrote:
>> Breakpoint 6, AllocSetAlloc (context=0x29a6450, size=8) at aset.c:718
>> 718 {
>> (gdb) bt 8
>> #0 AllocSetAlloc (context=0x29a6450, size=8) at aset.c:718
>> #1 0x000000000084e8ad in palloc0 (size=size(at)entry=8) at mcxt.c:969
>> #2 0x0000000000702b63 in makeBufFileCommon (nfiles=nfiles(at)entry=1) at buffile.c:119
>> #3 0x0000000000702e4c in makeBufFile (firstfile=68225) at buffile.c:138
>> #4 BufFileCreateTemp (interXact=interXact(at)entry=false) at buffile.c:201
>> #5 0x000000000061060b in ExecHashJoinSaveTuple (tuple=0x2ba1018, hashvalue=<optimized out>, fileptr=0x6305b00) at nodeHashjoin.c:1220
>> #6 0x000000000060d766 in ExecHashTableInsert (hashtable=hashtable(at)entry=0x2b50ad8, slot=<optimized out>, hashvalue=<optimized out>)
>> at nodeHash.c:1663
>> #7 0x0000000000610c8f in ExecHashJoinNewBatch (hjstate=0x29a6be0) at nodeHashjoin.c:1051
> Hmm ... this matches up with a vague thought I had that for some reason
> the hash join might be spawning a huge number of separate batches.
> Each batch would have a couple of files with associated in-memory
> state including an 8K I/O buffer, so you could account for the
> "slow growth" behavior you're seeing by periodic decisions to
> increase the number of batches.
>
> You might try watching calls to ExecHashIncreaseNumBatches
> and see if that theory holds water.

OK, checking that ... well yes, this breaks quickly into that, here is
one backtrace

#0 ExecHashIncreaseNumBatches (hashtable=hashtable(at)entry=0x2ae8ca8) at nodeHash.c:893
#1 0x000000000060d84a in ExecHashTableInsert (hashtable=hashtable(at)entry=0x2ae8ca8, slot=slot(at)entry=0x2ae0238,
hashvalue=<optimized out>) at nodeHash.c:1655
#2 0x000000000060fd9c in MultiExecPrivateHash (node=<optimized out>) at nodeHash.c:186
#3 MultiExecHash (node=node(at)entry=0x2ac6dc8) at nodeHash.c:114
#4 0x00000000005fe42f in MultiExecProcNode (node=node(at)entry=0x2ac6dc8) at execProcnode.c:501
#5 0x000000000061073d in ExecHashJoinImpl (parallel=false, pstate=0x2a1dd40) at nodeHashjoin.c:290
#6 ExecHashJoin (pstate=0x2a1dd40) at nodeHashjoin.c:565
#7 0x00000000005fde88 in ExecProcNodeInstr (node=0x2a1dd40) at execProcnode.c:461
#8 0x000000000061ce6e in ExecProcNode (node=0x2a1dd40) at ../../../src/include/executor/executor.h:247
#9 ExecSort (pstate=0x2a1dc30) at nodeSort.c:107
#10 0x00000000005fde88 in ExecProcNodeInstr (node=0x2a1dc30) at execProcnode.c:461
#11 0x000000000061d2e4 in ExecProcNode (node=0x2a1dc30) at ../../../src/include/executor/executor.h:247
#12 ExecUnique (pstate=0x2a1d9b0) at nodeUnique.c:73
#13 0x00000000005fde88 in ExecProcNodeInstr (node=0x2a1d9b0) at execProcnode.c:461
#14 0x00000000005f75da in ExecProcNode (node=0x2a1d9b0) at ../../../src/include/executor/executor.h:247
#15 ExecutePlan (execute_once=<optimized out>, dest=0xcc60e0 <donothingDR>, direction=<optimized out>, numberTuples=0,
sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x2a1d9b0, estate=0x2a1d6c0)
at execMain.c:1723
#16 standard_ExecutorRun (queryDesc=0x2a7a478, direction=<optimized out>, count=0, execute_once=<optimized out>) at execMain.c:364
#17 0x000000000059c718 in ExplainOnePlan (plannedstmt=plannedstmt(at)entry=0x2a787f8, into=into(at)entry=0x0, es=es(at)entry=0x28f1048,
queryString=<optimized out>, params=0x0, queryEnv=queryEnv(at)entry=0x0, planduration=0x7ffcbf930080) at explain.c:535

But this is still in the warm-up phase, we don't know if it is at the
place where memory grows too much.

> This could only happen with a very unfriendly distribution of the
> hash keys, I think. There's a heuristic in there to shut off
> growth of nbatch if we observe that we're making no progress at
> all, but perhaps this is a skewed distribution that's not quite
> skewed enough to trigger that.

Your hunch is pretty right on. There is something very weirdly
distributed in this particular join situation.

Let's see if I can count the occurrences ... I do cont 100. Now resident
memory slowly grows, but not too much just 122 kB and CPU is at 88%. I
think we haven't hit the problematic part of the plan. There is a sort
merge at some leaf, which I believe is innocent. My gut feeling from
looking at CPU% high that we are in one of those since NL is disabled.

Next stage is that memory shot up to 264 kB and CPU% down to 8.6.  Heavy
IO (write and read).

Yes! And now entering the 3rd stage, where memory shots up to 600 kB.
This is where it starts "breaking out". And only now the 100 breakpoint
conts are used up. And within a second another 100. And even 1000 go by
in a second. cont 10000 goes by in 4 seconds. And during that time
resident memory increased to over 700 kB. Let's measure:

736096 + cont 10000 --> 740056, that is 3960 bytes for 10000 conts, or
0.396 bytes per cont. Prediction: cont 10000 will now arrive at 744016?
Aaaand ... BINGO! 744016 exactly! cont 50000 will take about 20 seconds
and will boost memory to 763816. Bets are on ... drumroll ... 35, 36 ,
... nope. This time didn't pan out. Breakpoint already hit 75727 times
ignore next 5585 hits ... memory now 984052. So it took longer this time
and memory increment was larger. We are now getting toward the edge of
the cliff. Before we do here is the backtrace now:

#0 ExecHashIncreaseNumBatches (hashtable=hashtable(at)entry=0x2ae8ca8) at nodeHash.c:893
#1 0x000000000060d84a in ExecHashTableInsert (hashtable=hashtable(at)entry=0x2ae8ca8, slot=<optimized out>, hashvalue=<optimized out>)
at nodeHash.c:1655
#2 0x0000000000610c8f in ExecHashJoinNewBatch (hjstate=0x2a1dd40) at nodeHashjoin.c:1051
#3 ExecHashJoinImpl (parallel=false, pstate=0x2a1dd40) at nodeHashjoin.c:539
#4 ExecHashJoin (pstate=0x2a1dd40) at nodeHashjoin.c:565
#5 0x00000000005fde88 in ExecProcNodeInstr (node=0x2a1dd40) at execProcnode.c:461
#6 0x000000000061ce6e in ExecProcNode (node=0x2a1dd40) at ../../../src/include/executor/executor.h:247
#7 ExecSort (pstate=0x2a1dc30) at nodeSort.c:107
(More stack frames follow...)
(gdb) bt 18
#0 ExecHashIncreaseNumBatches (hashtable=hashtable(at)entry=0x2ae8ca8) at nodeHash.c:893
#1 0x000000000060d84a in ExecHashTableInsert (hashtable=hashtable(at)entry=0x2ae8ca8, slot=<optimized out>, hashvalue=<optimized out>)
at nodeHash.c:1655
#2 0x0000000000610c8f in ExecHashJoinNewBatch (hjstate=0x2a1dd40) at nodeHashjoin.c:1051
#3 ExecHashJoinImpl (parallel=false, pstate=0x2a1dd40) at nodeHashjoin.c:539
#4 ExecHashJoin (pstate=0x2a1dd40) at nodeHashjoin.c:565
#5 0x00000000005fde88 in ExecProcNodeInstr (node=0x2a1dd40) at execProcnode.c:461
#6 0x000000000061ce6e in ExecProcNode (node=0x2a1dd40) at ../../../src/include/executor/executor.h:247
#7 ExecSort (pstate=0x2a1dc30) at nodeSort.c:107
#8 0x00000000005fde88 in ExecProcNodeInstr (node=0x2a1dc30) at execProcnode.c:461
#9 0x000000000061d2e4 in ExecProcNode (node=0x2a1dc30) at ../../../src/include/executor/executor.h:247
#10 ExecUnique (pstate=0x2a1d9b0) at nodeUnique.c:73
#11 0x00000000005fde88 in ExecProcNodeInstr (node=0x2a1d9b0) at execProcnode.c:461
#12 0x00000000005f75da in ExecProcNode (node=0x2a1d9b0) at ../../../src/include/executor/executor.h:247
#13 ExecutePlan (execute_once=<optimized out>, dest=0xcc60e0 <donothingDR>, direction=<optimized out>, numberTuples=0,
sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x2a1d9b0, estate=0x2a1d6c0)
at execMain.c:1723
#14 standard_ExecutorRun (queryDesc=0x2a7a478, direction=<optimized out>, count=0, execute_once=<optimized out>) at execMain.c:364
#15 0x000000000059c718 in ExplainOnePlan (plannedstmt=plannedstmt(at)entry=0x2a787f8, into=into(at)entry=0x0, es=es(at)entry=0x28f1048,
queryString=<optimized out>, params=0x0, queryEnv=queryEnv(at)entry=0x0, planduration=0x7ffcbf930080) at explain.c:535

I ran the explain analyze of the plan while removing all the final
result columns from the outer-most select, replacing them with simply
SELECT 1 FROM .... And here is that plan. I am presenting it to you
because you might glean something about the whatever skewed distribution.

Hash Right Join (cost=4203858.53..5475530.71 rows=34619 width=4) (actual time=309603.384..459480.863 rows=113478386 loops=1)
Hash Cond: (((q.documentinternalid)::text = (documentinformationsubject.documentinternalid)::text) AND ((r.targetinternalid)::text = (documentinformationsubject.actinternalid)::text))
-> Hash Right Join (cost=1341053.37..2611158.36 rows=13 width=74) (actual time=109807.980..109808.040 rows=236 loops=1)
Hash Cond: (((documentinformationsubject_2.documentinternalid)::text = (q.documentinternalid)::text) AND ((documentinformationsubject_2.actinternalid)::text = (q.actinternalid)::text))
-> Gather (cost=30803.54..1300908.52 rows=1 width=74) (actual time=58730.915..58737.757 rows=0 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Hash Left Join (cost=29803.54..1299908.42 rows=1 width=74) (actual time=58723.378..58723.379 rows=0 loops=3)
Hash Cond: ((documentinformationsubject_2.otherentityinternalid)::text = (agencyid.entityinternalid)::text)
-> Parallel Hash Left Join (cost=28118.13..1298223.00 rows=1 width=111) (actual time=58713.650..58713.652 rows=0 loops=3)
Hash Cond: ((documentinformationsubject_2.otherentityinternalid)::text = (agencyname.entityinternalid)::text)
-> Parallel Seq Scan on documentinformationsubject documentinformationsubject_2 (cost=0.00..1268800.85 rows=1 width=111) (actual time=58544.391..58544.391 rows=0 loops=3)
Filter: ((participationtypecode)::text = 'AUT'::text)
Rows Removed by Filter: 2815562
-> Parallel Hash (cost=24733.28..24733.28 rows=166628 width=37) (actual time=125.611..125.611 rows=133303 loops=3)
Buckets: 65536 Batches: 16 Memory Usage: 2336kB
-> Parallel Seq Scan on bestname agencyname (cost=0.00..24733.28 rows=166628 width=37) (actual time=0.009..60.685 rows=133303 loops=3)
-> Parallel Hash (cost=1434.07..1434.07 rows=20107 width=37) (actual time=9.329..9.329 rows=11393 loops=3)
Buckets: 65536 Batches: 1 Memory Usage: 2976kB
-> Parallel Seq Scan on entity_id agencyid (cost=0.00..1434.07 rows=20107 width=37) (actual time=0.008..5.224 rows=11393 loops=3)
-> Hash (cost=1310249.63..1310249.63 rows=13 width=111) (actual time=51077.049..51077.049 rows=236 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 41kB
-> Hash Right Join (cost=829388.20..1310249.63 rows=13 width=111) (actual time=45607.852..51076.967 rows=236 loops=1)
Hash Cond: ((an.actinternalid)::text = (q.actinternalid)::text)
-> Seq Scan on act_id an (cost=0.00..425941.04 rows=14645404 width=37) (actual time=1.212..10883.350 rows=14676871 loops=1)
-> Hash (cost=829388.19..829388.19 rows=1 width=111) (actual time=38246.715..38246.715 rows=236 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 41kB
-> Gather (cost=381928.46..829388.19 rows=1 width=111) (actual time=31274.733..38246.640 rows=236 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Hash Join (cost=380928.46..828388.09 rows=1 width=111) (actual time=31347.260..38241.812 rows=79 loops=3)
Hash Cond: ((q.actinternalid)::text = (r.sourceinternalid)::text)
-> Parallel Seq Scan on documentinformation q (cost=0.00..447271.93 rows=50050 width=74) (actual time=13304.439..20265.733 rows=87921 loops=3)
Filter: (((classcode)::text = 'CNTRCT'::text) AND ((moodcode)::text = 'EVN'::text) AND ((code_codesystem)::text = '2.16.840.1.113883.3.26.1.1'::text))
Rows Removed by Filter: 1540625
-> Parallel Hash (cost=380928.44..380928.44 rows=1 width=74) (actual time=17954.106..17954.106 rows=79 loops=3)
Buckets: 1024 Batches: 1 Memory Usage: 104kB
-> Parallel Seq Scan on actrelationship r (cost=0.00..380928.44 rows=1 width=74) (actual time=7489.704..17953.959 rows=79 loops=3)
Filter: ((typecode)::text = 'SUBJ'::text)
Rows Removed by Filter: 3433326
-> Hash (cost=2861845.87..2861845.87 rows=34619 width=74) (actual time=199792.446..199792.446 rows=113478127 loops=1)
Buckets: 65536 (originally 65536) Batches: 131072 (originally 2) Memory Usage: 189207kB
-> Gather Merge (cost=2845073.40..2861845.87 rows=34619 width=74) (actual time=107620.262..156256.432 rows=113478127 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Merge Left Join (cost=2844073.37..2856849.96 rows=14425 width=74) (actual time=107570.719..126113.792 rows=37826042 loops=3)
Merge Cond: (((documentinformationsubject.documentinternalid)::text = (documentinformationsubject_1.documentinternalid)::text) AND ((documentinformationsubject.documentid)::text = (documentinformationsubject_1.documentid)::text) AND ((documentinformationsubject.actinternalid)::text = (documentinformationsubject_1.actinternalid)::text))
-> Sort (cost=1295969.26..1296005.32 rows=14425 width=111) (actual time=57700.723..58134.751 rows=231207 loops=3)
Sort Key: documentinformationsubject.documentinternalid, documentinformationsubject.documentid, documentinformationsubject.actinternalid
Sort Method: external merge Disk: 26936kB
Worker 0: Sort Method: external merge Disk: 27152kB
Worker 1: Sort Method: external merge Disk: 28248kB
-> Parallel Seq Scan on documentinformationsubject (cost=0.00..1294972.76 rows=14425 width=111) (actual time=24866.656..57424.420 rows=231207 loops=3)
Filter: (((participationtypecode)::text = ANY ('{PPRF,PRF}'::text[])) AND ((classcode)::text = 'ACT'::text) AND ((moodcode)::text = 'DEF'::text) AND ((code_codesystem)::text = '2.16.840.1.113883.3.26.1.1'::text))
Rows Removed by Filter: 2584355
-> Materialize (cost=1548104.12..1553157.04 rows=1010585 width=111) (actual time=49869.984..54191.701 rows=38060250 loops=3)
-> Sort (cost=1548104.12..1550630.58 rows=1010585 width=111) (actual time=49869.980..50832.205 rows=1031106 loops=3)
Sort Key: documentinformationsubject_1.documentinternalid, documentinformationsubject_1.documentid, documentinformationsubject_1.actinternalid
Sort Method: external merge Disk: 122192kB
Worker 0: Sort Method: external merge Disk: 122192kB
Worker 1: Sort Method: external merge Disk: 122192kB
-> Seq Scan on documentinformationsubject documentinformationsubject_1 (cost=0.00..1329868.64 rows=1010585 width=111) (actual time=20366.166..47751.267 rows=1031106 loops=3)
Filter: ((participationtypecode)::text = 'PRD'::text)
Rows Removed by Filter: 7415579
Planning Time: 2.523 ms
Execution Time: 464825.391 ms
(66 rows)

*By the way, let me ask, do you have pretty-print functions I can call
with, e.g., node in ExecProcNode, or pstate in ExecHashJoin? Because if
there was, then we could determine where exactly in the current plan we
are? And can I call the plan printer for the entire plan we are
currently executing? Might it even give us preliminary counts of where
in the process it is? (I ask the latter not only because it would be
really useful for our present debugging, but also because it would be an
awesome tool for monitoring of long running queries! Something I am sure
tons of people would just love to have!)*

I also read the other responses. I agree that having a swap space
available just in case is better than these annoying out of memory
errors. And yes, I can add that memory profiler thing, if you think it
would actually work. I've done it with java heap dumps, even upgrading
the VM to a 32 GB VM just to crunch the heap dump. But can you tell me
just a little more as to how I need to configure this thing to get the
data you want without blowing up the memory and disk during this huge query?

regards,
-Gunther

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Mahmoud Moharam 2019-04-18 06:15:17 Fwd: iscsi performance
Previous Message Stephan Schmidt 2019-04-18 02:19:28 AW: Best Filesystem for PostgreSQL