Quick Links

Re: stress test for parallel workers

From:	Justin Pryzby <pryzby(at)telsasoft(dot)com>
To:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: stress test for parallel workers
Date:	2019-07-23 23:04:40
Message-ID:	20190723230440.GU22387@telsasoft.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Wed, Jul 24, 2019 at 10:46:42AM +1200, Thomas Munro wrote:
> On Wed, Jul 24, 2019 at 10:42 AM Justin Pryzby <pryzby(at)telsasoft(dot)com> wrote:
> > On Wed, Jul 24, 2019 at 10:03:25AM +1200, Thomas Munro wrote:
> > > On Wed, Jul 24, 2019 at 5:42 AM Justin Pryzby <pryzby(at)telsasoft(dot)com> wrote:
> > > > #2 0x000000000085ddff in errfinish (dummy=<value optimized out>) at elog.c:555
> > > > edata = <value optimized out>
> > >
> > > If you have that core, it might be interesting to go to frame 2 and
> > > print *edata or edata->saved_errno.
> >
> > As you saw..unless someone you know a trick, it's "optimized out".
>
> How about something like this:
>
> print errorData[errordata_stack_depth]

Clever.

(gdb) p errordata[errordata_stack_depth]
$2 = {elevel = 13986192, output_to_server = 254, output_to_client = 127, show_funcname = false, hide_stmt = false, hide_ctx = false, filename = 0x27b3790 "< %m %u >", lineno = 41745456,
funcname = 0x3030313335 <Address 0x3030313335 out of bounds>, domain = 0x0, context_domain = 0x27cff90 "postgres", sqlerrcode = 0, message = 0xe8800000001 <Address 0xe8800000001 out of bounds>,
detail = 0x297a <Address 0x297a out of bounds>, detail_log = 0x0, hint = 0xe88 <Address 0xe88 out of bounds>, context = 0x297a <Address 0x297a out of bounds>, message_id = 0x0, schema_name = 0x0,
table_name = 0x0, column_name = 0x0, datatype_name = 0x0, constraint_name = 0x0, cursorpos = 0, internalpos = 0, internalquery = 0x0, saved_errno = 0, assoc_context = 0x0}
(gdb) p errordata
$3 = {{elevel = 22, output_to_server = true, output_to_client = false, show_funcname = false, hide_stmt = false, hide_ctx = false, filename = 0x9c4030 "origin.c", lineno = 591,
funcname = 0x9c46e0 "CheckPointReplicationOrigin", domain = 0x9ac810 "postgres-11", context_domain = 0x9ac810 "postgres-11", sqlerrcode = 4293,
message = 0x27b0e40 "could not write to file \"pg_logical/replorigin_checkpoint.tmp\": No space left on device", detail = 0x0, detail_log = 0x0, hint = 0x0, context = 0x0,
message_id = 0x8a7aa8 "could not write to file \"%s\": %m", ...

I ought to have remembered that it *was* in fact out of space this AM when this
core was dumped (due to having not touched it since scheduling transition to
this VM last week).

I want to say I'm almost certain it wasn't ENOSPC in other cases, since,
failing to find log output, I ran df right after the failure.

But that gives me an idea: is it possible there's an issue with files being
held opened by worker processes ? Including by parallel workers? Probably
WALs, even after they're rotated ? If there were worker processes holding
opened lots of rotated WALs, that could cause ENOSPC, but that wouldn't be
obvious after they die, since the space would then be freed.

Justin

In response to

Re: stress test for parallel workers at 2019-07-23 22:46:42 from Thomas Munro

Responses

Re: stress test for parallel workers at 2019-07-23 23:29:04 from Tom Lane
Re: stress test for parallel workers at 2019-07-23 23:32:30 from Thomas Munro
Re: stress test for parallel workers at 2019-07-23 23:57:34 from Alvaro Herrera

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2019-07-23 23:13:51	Re: pgbench tests vs Windows
Previous Message	Nikita Glukhov	2019-07-23 22:48:26	Re: Support for jsonpath .datetime() method