| From: | Justin Pryzby <pryzby(at)telsasoft(dot)com> |
|---|---|
| To: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
| Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
| Subject: | Re: stress test for parallel workers |
| Date: | 2019-07-23 23:04:40 |
| Message-ID: | 20190723230440.GU22387@telsasoft.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Wed, Jul 24, 2019 at 10:46:42AM +1200, Thomas Munro wrote:
> On Wed, Jul 24, 2019 at 10:42 AM Justin Pryzby <pryzby(at)telsasoft(dot)com> wrote:
> > On Wed, Jul 24, 2019 at 10:03:25AM +1200, Thomas Munro wrote:
> > > On Wed, Jul 24, 2019 at 5:42 AM Justin Pryzby <pryzby(at)telsasoft(dot)com> wrote:
> > > > #2 0x000000000085ddff in errfinish (dummy=<value optimized out>) at elog.c:555
> > > > edata = <value optimized out>
> > >
> > > If you have that core, it might be interesting to go to frame 2 and
> > > print *edata or edata->saved_errno.
> >
> > As you saw..unless someone you know a trick, it's "optimized out".
>
> How about something like this:
>
> print errorData[errordata_stack_depth]
Clever.
(gdb) p errordata[errordata_stack_depth]
$2 = {elevel = 13986192, output_to_server = 254, output_to_client = 127, show_funcname = false, hide_stmt = false, hide_ctx = false, filename = 0x27b3790 "< %m %u >", lineno = 41745456,
funcname = 0x3030313335 <Address 0x3030313335 out of bounds>, domain = 0x0, context_domain = 0x27cff90 "postgres", sqlerrcode = 0, message = 0xe8800000001 <Address 0xe8800000001 out of bounds>,
detail = 0x297a <Address 0x297a out of bounds>, detail_log = 0x0, hint = 0xe88 <Address 0xe88 out of bounds>, context = 0x297a <Address 0x297a out of bounds>, message_id = 0x0, schema_name = 0x0,
table_name = 0x0, column_name = 0x0, datatype_name = 0x0, constraint_name = 0x0, cursorpos = 0, internalpos = 0, internalquery = 0x0, saved_errno = 0, assoc_context = 0x0}
(gdb) p errordata
$3 = {{elevel = 22, output_to_server = true, output_to_client = false, show_funcname = false, hide_stmt = false, hide_ctx = false, filename = 0x9c4030 "origin.c", lineno = 591,
funcname = 0x9c46e0 "CheckPointReplicationOrigin", domain = 0x9ac810 "postgres-11", context_domain = 0x9ac810 "postgres-11", sqlerrcode = 4293,
message = 0x27b0e40 "could not write to file \"pg_logical/replorigin_checkpoint.tmp\": No space left on device", detail = 0x0, detail_log = 0x0, hint = 0x0, context = 0x0,
message_id = 0x8a7aa8 "could not write to file \"%s\": %m", ...
I ought to have remembered that it *was* in fact out of space this AM when this
core was dumped (due to having not touched it since scheduling transition to
this VM last week).
I want to say I'm almost certain it wasn't ENOSPC in other cases, since,
failing to find log output, I ran df right after the failure.
But that gives me an idea: is it possible there's an issue with files being
held opened by worker processes ? Including by parallel workers? Probably
WALs, even after they're rotated ? If there were worker processes holding
opened lots of rotated WALs, that could cause ENOSPC, but that wouldn't be
obvious after they die, since the space would then be freed.
Justin
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Tom Lane | 2019-07-23 23:13:51 | Re: pgbench tests vs Windows |
| Previous Message | Nikita Glukhov | 2019-07-23 22:48:26 | Re: Support for jsonpath .datetime() method |