Re: stress test for parallel workers

From: Justin Pryzby <pryzby(at)telsasoft(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Subject: Re: stress test for parallel workers
Date: 2019-07-23 17:42:22
Message-ID: 20190723174222.GO22387@telsasoft.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jul 23, 2019 at 01:28:47PM -0400, Tom Lane wrote:
> ... you'd think an OOM kill would show up in the kernel log.
> (Not necessarily in dmesg, though. Did you try syslog?)

Nothing in /var/log/messages (nor dmesg ring).

I enabled abrtd while trying to reproduce it last week. Since you asked I
looked again in messages, and found it'd logged 10 hours ago about this:

(gdb) bt
#0 0x000000395be32495 in raise () from /lib64/libc.so.6
#1 0x000000395be33c75 in abort () from /lib64/libc.so.6
#2 0x000000000085ddff in errfinish (dummy=<value optimized out>) at elog.c:555
#3 0x00000000006f7e94 in CheckPointReplicationOrigin () at origin.c:588
#4 0x00000000004f6ef1 in CheckPointGuts (checkPointRedo=5507491783792, flags=128) at xlog.c:9150
#5 0x00000000004feff6 in CreateCheckPoint (flags=128) at xlog.c:8937
#6 0x00000000006d49e2 in CheckpointerMain () at checkpointer.c:491
#7 0x000000000050fe75 in AuxiliaryProcessMain (argc=2, argv=0x7ffe00d56b00) at bootstrap.c:451
#8 0x00000000006dcf54 in StartChildProcess (type=CheckpointerProcess) at postmaster.c:5337
#9 0x00000000006de78a in reaper (postgres_signal_arg=<value optimized out>) at postmaster.c:2867
#10 <signal handler called>
#11 0x000000395bee1603 in __select_nocancel () from /lib64/libc.so.6
#12 0x00000000006e1488 in ServerLoop (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1671
#13 PostmasterMain (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1380
#14 0x0000000000656420 in main (argc=3, argv=0x27ae410) at main.c:228

#2 0x000000000085ddff in errfinish (dummy=<value optimized out>) at elog.c:555
edata = <value optimized out>
elevel = 22
oldcontext = 0x27e15d0
econtext = 0x0
__func__ = "errfinish"
#3 0x00000000006f7e94 in CheckPointReplicationOrigin () at origin.c:588
save_errno = <value optimized out>
tmppath = 0x9c4518 "pg_logical/replorigin_checkpoint.tmp"
path = 0x9c4300 "pg_logical/replorigin_checkpoint"
tmpfd = 64
i = <value optimized out>
magic = 307747550
crc = 4294967295
__func__ = "CheckPointReplicationOrigin"
#4 0x00000000004f6ef1 in CheckPointGuts (checkPointRedo=5507491783792, flags=128) at xlog.c:9150
No locals.
#5 0x00000000004feff6 in CreateCheckPoint (flags=128) at xlog.c:8937
shutdown = false
checkPoint = {redo = 5507491783792, ThisTimeLineID = 1, PrevTimeLineID = 1, fullPageWrites = true, nextXidEpoch = 0, nextXid = 2141308, nextOid = 496731439, nextMulti = 1, nextMultiOffset = 0,
oldestXid = 561, oldestXidDB = 1, oldestMulti = 1, oldestMultiDB = 1, time = 1563781930, oldestCommitTsXid = 0, newestCommitTsXid = 0, oldestActiveXid = 2141308}
recptr = <value optimized out>
_logSegNo = <value optimized out>
Insert = <value optimized out>
freespace = <value optimized out>
PriorRedoPtr = <value optimized out>
curInsert = <value optimized out>
last_important_lsn = <value optimized out>
vxids = 0x280afb8
nvxids = 0
__func__ = "CreateCheckPoint"
#6 0x00000000006d49e2 in CheckpointerMain () at checkpointer.c:491
ckpt_performed = false
do_restartpoint = <value optimized out>
flags = 128
do_checkpoint = <value optimized out>
now = 1563781930
elapsed_secs = <value optimized out>
cur_timeout = <value optimized out>
rc = <value optimized out>
local_sigjmp_buf = {{__jmpbuf = {2, -1669940128760174522, 9083146, 0, 140728912407216, 140728912407224, -1669940128812603322, 1670605924426606662}, __mask_was_saved = 1, __saved_mask = {__val = {
18446744066192964103, 0, 246358747096, 140728912407296, 140446084917816, 140446078556040, 9083146, 0, 246346239061, 140446078556040, 140447207471460, 0, 140447207471424, 140446084917816, 0,
7864320}}}}
checkpointer_context = 0x27e15d0
__func__ = "CheckpointerMain"

Supposedly it's trying to do this:

| ereport(PANIC,
| (errcode_for_file_access(),
| errmsg("could not write to file \"%s\": %m",
| tmppath)));

And since there's consistently nothing in logs, I'm guessing there's a
legitimate write error (legitimate from PG perspective). Storage here is ext4
plus zfs tablespace on top of LVM on top of vmware thin volume.

Justin

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeevan Ladhe 2019-07-23 17:48:50 Re: block-level incremental backup
Previous Message Alvaro Herrera 2019-07-23 17:31:53 Re: [bug fix] Produce a crash dump before main() on Windows