Re: Out of Memory errors are frustrating as heck!

From: Gunther <raj(at)gusw(dot)net>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Justin Pryzby <pryzby(at)telsasoft(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-performance(at)lists(dot)postgresql(dot)org
Subject: Re: Out of Memory errors are frustrating as heck!
Date: 2019-04-22 00:55:03
Message-ID: 2917100e-2849-2ce0-2774-8b75384cc2f9@gusw.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Now to Justin's patch. First undo Tomas' patch and apply:

$ mv src/include/executor/hashjoin.h.orig src/include/executor/hashjoin.h
$ mv src/backend/executor/nodeHash.c.orig src/backend/executor/nodeHash.c
$ mv src/backend/executor/nodeHashjoin.c.orig src/backend/executor/nodeHashjoin.c
$ patch -p1 <../limit-hash-nbatches-v2.patch
patching file src/backend/executor/nodeHash.c
Hunk #1 succeeded at 570 (offset -3 lines).
Hunk #2 succeeded at 917 (offset -3 lines).
Hunk #3 succeeded at 930 (offset -3 lines).
Hunk #4 succeeded at 1037 (offset -3 lines).
Hunk #5 succeeded at 1658 (offset -4 lines).

$ make
$ make install
$ pg_ctl -c restart

and go ...

lots of CPU% again and very limited memory use as of yet.

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
11054 postgres 20 0 1302772 90316 58004 R 94.4 1.1 4:38.05 postgres: postgres integrator [local] EXPLAIN
11055 postgres 20 0 1280532 68076 57168 R 97.7 0.9 2:03.54 postgres: parallel worker for PID 11054
11056 postgres 20 0 1280532 67964 57124 S 0.0 0.9 2:08.28 postgres: parallel worker for PID 11054

that's a pretty decent sign so far. Slight increase ... but still
relatively steady ...

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
11054 postgres 20 0 1379704 167140 58004 R 95.0 2.1 5:56.28 postgres: postgres integrator [local] EXPLAIN
11055 postgres 20 0 1280532 68076 57168 S 25.6 0.9 2:36.59 postgres: parallel worker for PID 11054
11056 postgres 20 0 1280532 67964 57124 R 61.8 0.9 2:29.65 postgres: parallel worker for PID 11054

aaand break out ...

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
11119 postgres 20 0 1271660 1.0g 1.0g D 0.0 13.4 0:03.10 postgres: parallel worker for PID 11054
11054 postgres 20 0 1380940 1.0g 950508 D 0.0 13.4 6:56.09 postgres: postgres integrator [local] EXPLAIN
11118 postgres 20 0 1271660 884540 882724 D 0.0 11.2 0:02.84 postgres: parallel worker for PID 11054

and crash:

foo=# explain analyze select * from reports.v_BusinessOperation;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!>

what happened? ouch, no space left on root device, too much logging?
Maybe the core dump ... Log file content is simple:

2019-04-22 00:07:56.104 UTC [11048] LOG: database system was shut down at 2019-04-22 00:07:55 UTC
2019-04-22 00:07:56.108 UTC [11046] LOG: database system is ready to accept connections
Hashjoin 0x2122458: initial nbatch = 16, nbuckets = 8192
Hashjoin 0x2122458: increasing nbatch to 32 because space = 4128933
Hashjoin 0x2122458: freed 148 of 10584 tuples, space now 4071106
Hashjoin 0x2122458: increasing nbatch to 64 because space = 4128826
Hashjoin 0x2122458: freed 544 of 10584 tuples, space now 3916296
Hashjoin 0x2122458: increasing nbatch to 128 because space = 4128846
Hashjoin 0x2122458: freed 10419 of 10585 tuples, space now 65570
Hashjoin 0x2122458: increasing nbatch to 256 because space = 4128829
Hashjoin 0x2122458: freed 10308 of 10734 tuples, space now 161815
Hashjoin 0x2122458: increasing nbatch to 512 because space = 4128908
Hashjoin 0x2122458: freed 398 of 10379 tuples, space now 3977787
Hashjoin 0x3ac9918: initial nbatch = 1, nbuckets = 1024
Hashjoin 0x3ac91a8: initial nbatch = 1, nbuckets = 1024
Hashjoin 0x3ac93c8: initial nbatch = 1, nbuckets = 1024
Hashjoin 0x1f41018: initial nbatch = 1, nbuckets = 1024
Hashjoin 0x1f41048: initial nbatch = 1, nbuckets = 1024
2019-04-22 00:16:55.273 UTC [11046] LOG: background worker "parallel worker" (PID 11119) was terminated by signal 11: Segmentation fault
2019-04-22 00:16:55.273 UTC [11046] DETAIL: Failed process was running: explain analyze select * from reports.v_BusinessOperation;
2019-04-22 00:16:55.273 UTC [11046] LOG: terminating any other active server processes
2019-04-22 00:16:55.274 UTC [11058] WARNING: terminating connection because of crash of another server process
2019-04-22 00:16:55.274 UTC [11058] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2019-04-22 00:16:55.274 UTC [11058] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2019-04-22 00:16:55.277 UTC [11052] LOG: could not write temporary statistics file "pg_stat/db_16903.tmp": No space left on device
2019-04-22 00:16:55.278 UTC [11052] LOG: could not close temporary statistics file "pg_stat/db_0.tmp": No space left on device
2019-04-22 00:16:55.278 UTC [11052] LOG: could not close temporary statistics file "pg_stat/global.tmp": No space left on device
2019-04-22 00:16:55.315 UTC [11046] LOG: all server processes terminated; reinitializing
2019-04-22 00:16:55.425 UTC [11123] LOG: database system was interrupted; last known up at 2019-04-22 00:12:56 UTC
2019-04-22 00:16:55.426 UTC [11124] FATAL: the database system is in recovery mode
2019-04-22 00:16:55.545 UTC [11123] LOG: database system was not properly shut down; automatic recovery in progress
2019-04-22 00:16:55.549 UTC [11123] LOG: redo starts at 3D2/C44FDCF8

ok it is all because it dumped 3 core dumps, glad that I captured the
top lines of the backend and its 2 workers

-rw------- 1 postgres postgres 1075843072 Apr 22 00:16 core.11054 -- backend
-rw------- 1 postgres postgres  894640128 Apr 22 00:16 core.11118 --
worker 1
-rw------- 1 postgres postgres 1079103488 Apr 22 00:16 core.11119 --
worker 2

And the melt down starts with "parallel worker" (PID 11119) receiving
SIGSEGV.

So let's get gdb to the task to see what's up:

$ gdb -c data/core.11119 postgresql-11.2/src/backend/postgres
...
Reading symbols from postgresql-11.2/src/backend/postgres...done.
BFD: Warning: /var/lib/pgsql/data/core.11119 is truncated: expected core file size >= 1127112704, found: 1079103488.
[New LWP 11119]
Cannot access memory at address 0x7ff8d25dc108
Cannot access memory at address 0x7ff8d25dc100
Failed to read a valid object file image from memory.
Core was generated by `postgres: parallel worker for'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00000000006bd792 in ExecParallelHashJoinNewBatch (
hjstate=<error reading variable: Cannot access memory at address 0x7ffd45fa9c38>) at nodeHashjoin.c:1127
1127 {
(gdb) bt 8
#0 0x00000000006bd792 in ExecParallelHashJoinNewBatch (
hjstate=<error reading variable: Cannot access memory at address 0x7ffd45fa9c38>) at nodeHashjoin.c:1127
Backtrace stopped: Cannot access memory at address 0x7ffd45fa9c88
(gdb) info frame
Stack level 0, frame at 0x7ffd45fa9c90:
rip = 0x6bd792 in ExecParallelHashJoinNewBatch (nodeHashjoin.c:1127); saved rip = <not saved>
Outermost frame: Cannot access memory at address 0x7ffd45fa9c88
source language c.
Arglist at 0x7ffd45fa9c80, args: hjstate=<error reading variable: Cannot access memory at address 0x7ffd45fa9c38>
Locals at 0x7ffd45fa9c80, Previous frame's sp is 0x7ffd45fa9c90
Cannot access memory at address 0x7ffd45fa9c80
(gdb) list
1122 SharedTuplestoreAccessor *inner_tuples;
1123 Barrier *batch_barrier =
1124 &hashtable->batches[batchno].shared->batch_barrier;
1125
1126 switch (BarrierAttach(batch_barrier))
1127 {
1128 case PHJ_BATCH_ELECTING:
1129
1130 /* One backend allocates the hash table. */
1131 if (BarrierArriveAndWait(batch_barrier,

unfortunately this core file is truncated because of the file system
running out of space. Let's see the others.

$ gdb -c data/core.11118 postgresql-11.2/src/backend/postgres
...
Reading symbols from postgresql-11.2/src/backend/postgres...done.
BFD: Warning: /var/lib/pgsql/data/core.11118 is truncated: expected core file size >= 1127112704, found: 894640128.
[New LWP 11118]
Cannot access memory at address 0x7ff8d25dc108
Cannot access memory at address 0x7ff8d25dc100
Failed to read a valid object file image from memory.
Core was generated by `postgres: parallel worker for'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00000000006bd792 in ExecParallelHashJoinNewBatch (
hjstate=<error reading variable: Cannot access memory at address 0x7ffd45fa9c38>) at nodeHashjoin.c:1127
1127 {
(gdb) bt 5
#0 0x00000000006bd792 in ExecParallelHashJoinNewBatch (
hjstate=<error reading variable: Cannot access memory at address 0x7ffd45fa9c38>) at nodeHashjoin.c:1127
Backtrace stopped: Cannot access memory at address 0x7ffd45fa9c88
(gdb) list
1122 SharedTuplestoreAccessor *inner_tuples;
1123 Barrier *batch_barrier =
1124 &hashtable->batches[batchno].shared->batch_barrier;
1125
1126 switch (BarrierAttach(batch_barrier))
1127 {
1128 case PHJ_BATCH_ELECTING:
1129
1130 /* One backend allocates the hash table. */
1131 if (BarrierArriveAndWait(batch_barrier,

strange, that one must have died very similar, same place, also truncated.

$ gdb -c data/core.11054 postgresql-11.2/src/backend/postgres
...
Reading symbols from postgresql-11.2/src/backend/postgres...done.
BFD: Warning: /var/lib/pgsql/data/core.11054 is truncated: expected core file size >= 1238786048, found: 1075843072.
[New LWP 11054]
Cannot access memory at address 0x7ff8d25dc108
Cannot access memory at address 0x7ff8d25dc100
Failed to read a valid object file image from memory.
Core was generated by `postgres: postgres integrator'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00000000006bd792 in ExecParallelHashJoinNewBatch (
hjstate=<error reading variable: Cannot access memory at address 0x7ffd45fa9498>) at nodeHashjoin.c:1127
1127 {
(

I don't understand why all of them are at the same location. Doesn't
make any sense to me.

But I'll leave it at that right now.

-Gunther

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Tomas Vondra 2019-04-22 03:09:27 Re: Out of Memory errors are frustrating as heck!
Previous Message Gunther 2019-04-21 23:58:49 Re: Out of Memory errors are frustrating as heck!