Re: stress test for parallel workers

From: Justin Pryzby <pryzby(at)telsasoft(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: stress test for parallel workers
Date: 2019-07-23 22:42:55
Message-ID: 20190723224255.GT22387@telsasoft.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jul 24, 2019 at 10:03:25AM +1200, Thomas Munro wrote:
> On Wed, Jul 24, 2019 at 5:42 AM Justin Pryzby <pryzby(at)telsasoft(dot)com> wrote:
> > #2 0x000000000085ddff in errfinish (dummy=<value optimized out>) at elog.c:555
> > edata = <value optimized out>
>
> If you have that core, it might be interesting to go to frame 2 and
> print *edata or edata->saved_errno.

As you saw..unless someone you know a trick, it's "optimized out".

> Could it have been fleetingly full due to some other thing happening on the
> system that rapidly expands and contracts?

It's not impossible, especially while loading data, and data_dir is only 64GB;
it may have happened that way sometimes; but it's hard to believe I that's been
the case 5-10 times now. If I don't forget to drop the old database previously
loaded, when loading old/historic data, it should have ~40GB free on data_dir,
and no clients connected other than pg_restore.

$ df -h /var/lib/pgsql
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/data-postgres
64G 26G 38G 41% /var/lib/pgsql

> | ereport(PANIC,
> | (errcode_for_file_access(),
> | errmsg("could not write to file \"%s\": %m",
> | tmppath)));
>
> And since there's consistently nothing in logs, I'm guessing there's a
> legitimate write error (legitimate from PG perspective). Storage here is ext4
> plus zfs tablespace on top of LVM on top of vmware thin volume.

I realized this probably is *not* an issue with zfs, since it's failing to log
(for one reason or another) to /var/lib/pgsql (ext4).

> Perhaps it would be clearer what's going on if you could put the PostgreSQL
> log onto a different filesystem, so we get a better chance of collecting
> evidence?

I didn't mention it but last weekend I'd left a loop around the restore process
running overnight, and had convinced myself the issue didn't recur since their
faulty blade was taken out of service... My plan was to leave the server
running in the foreground with logging_collector=no, which I hope is enough,
unless logging is itself somehow implicated. I'm trying to stress test that
way now.

> But then... the parallel leader process was apparently able
> to log something -- maybe it was just lucky, but you said this
> happened this way more than once. I'm wondering how it could be that
> you got some kind of IO failure and weren't able to log the PANIC
> message AND your postmaster was killed, and you were able to log a
> message about that. Perhaps we're looking at evidence from two
> unrelated failures.

The messages from the parallel leader (building indices) were visible to the
client, not via the server log. I was loading their data and the errors were
visible when pg_restore failed.

On Wed, Jul 24, 2019 at 09:10:41AM +1200, Thomas Munro wrote:
> Just by the way, parallelism in CREATE INDEX is controlled by
> max_parallel_maintenance_workers, not max_parallel_workers_per_gather.

Thank you.

Justin

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2019-07-23 22:46:42 Re: stress test for parallel workers
Previous Message Thomas Munro 2019-07-23 22:40:16 Re: stress test for parallel workers