Re: stress test for parallel workers

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Justin Pryzby <pryzby(at)telsasoft(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: stress test for parallel workers
Date: 2019-07-23 22:03:25
Message-ID: CA+hUKGLch1bNWdG-G8YaeJbyVsper6hG86Ugx9tSWG3=a1R89Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jul 24, 2019 at 5:42 AM Justin Pryzby <pryzby(at)telsasoft(dot)com> wrote:
> #2 0x000000000085ddff in errfinish (dummy=<value optimized out>) at elog.c:555
> edata = <value optimized out>
> elevel = 22
> oldcontext = 0x27e15d0
> econtext = 0x0
> __func__ = "errfinish"
> #3 0x00000000006f7e94 in CheckPointReplicationOrigin () at origin.c:588
> save_errno = <value optimized out>
> tmppath = 0x9c4518 "pg_logical/replorigin_checkpoint.tmp"
> path = 0x9c4300 "pg_logical/replorigin_checkpoint"
> tmpfd = 64
> i = <value optimized out>
> magic = 307747550
> crc = 4294967295
> __func__ = "CheckPointReplicationOrigin"

> Supposedly it's trying to do this:
>
> | ereport(PANIC,
> | (errcode_for_file_access(),
> | errmsg("could not write to file \"%s\": %m",
> | tmppath)));
>
> And since there's consistently nothing in logs, I'm guessing there's a
> legitimate write error (legitimate from PG perspective). Storage here is ext4
> plus zfs tablespace on top of LVM on top of vmware thin volume.

If you have that core, it might be interesting to go to frame 2 and
print *edata or edata->saved_errno. If the errno is EIO, it's a bit
strange if that's not showing up in some form in kernel logs or dmesg
or something; if it's ENOSPC I guess it'd be normal that it doesn't
show up anywhere and there is nothing in the PostgreSQL logs if
they're on the same full filesystem, but then you would probably
already have mentioned that your filesystem was out of space. Could
it have been fleetingly full due to some other thing happening on the
system that rapidly expands and contracts?

I'm confused by the evidence, though. If this PANIC is the origin of
the problem, how do we get to postmaster-death based exit in a
parallel leader*, rather than quickdie() (the kind of exit that
happens when the postmaster sends SIGQUIT to every process and they
say "terminating connection because of crash of another server
process", because some backend crashed or panicked). Perhaps it would
be clearer what's going on if you could put the PostgreSQL log onto a
different filesystem, so we get a better chance of collecting
evidence? But then... the parallel leader process was apparently able
to log something -- maybe it was just lucky, but you said this
happened this way more than once. I'm wondering how it could be that
you got some kind of IO failure and weren't able to log the PANIC
message AND your postmaster was killed, and you were able to log a
message about that. Perhaps we're looking at evidence from two
unrelated failures.

*I suspect that the only thing implicating parallelism in this failure
is that parallel leaders happen to print out that message if the
postmaster dies while they are waiting for workers; most other places
(probably every other backend in your cluster) just quietly exit.
That tells us something about what's happening, but on its own doesn't
tell us that parallelism plays an important role in the failure mode.

--
Thomas Munro
https://enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fabien COELHO 2019-07-23 22:06:58 Re: make \d pg_toast.foo show its indices ; and, \d toast show its main table ; and \d relkind=I show its partitions
Previous Message Alvaro Herrera 2019-07-23 21:34:29 Re: getting ERROR "relation 16401 has no triggers" with partition foreign key alter