Re: stress test for parallel workers

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Justin Pryzby <pryzby(at)telsasoft(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: stress test for parallel workers
Date: 2019-07-23 23:32:30
Message-ID: CA+hUKG+mZ=FjC3jyGK94vjJjL+SgO7ocFcF2Pm7MBWo6nuSKzg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jul 24, 2019 at 11:04 AM Justin Pryzby <pryzby(at)telsasoft(dot)com> wrote:
> I ought to have remembered that it *was* in fact out of space this AM when this
> core was dumped (due to having not touched it since scheduling transition to
> this VM last week).
>
> I want to say I'm almost certain it wasn't ENOSPC in other cases, since,
> failing to find log output, I ran df right after the failure.

Ok, cool, so the ENOSPC thing we understand, and the postmaster death
thing is probably something entirely different. Which brings us to
the question: what is killing your postmaster or causing it to exit
silently and unexpectedly, but leaving no trace in any operating
system log? You mentioned that you couldn't see any signs of the OOM
killer. Are you in a situation to test an OOM failure so you can
confirm what that looks like on your system? You might try typing
this into Python:

x = [42]
for i in range(1000):
x = x + x

On my non-Linux system, it ran for a while and then was killed, and
dmesg showed:

pid 15956 (python3.6), jid 0, uid 1001, was killed: out of swap space
pid 40238 (firefox), jid 0, uid 1001, was killed: out of swap space

Admittedly it is quite hard for to distinguish between a web browser
and a program designed to eat memory as fast as possible... Anyway on
Linux you should see stuff about killed processes and/or OOM in one of
dmesg, syslog, messages.

> But that gives me an idea: is it possible there's an issue with files being
> held opened by worker processes ? Including by parallel workers? Probably
> WALs, even after they're rotated ? If there were worker processes holding
> opened lots of rotated WALs, that could cause ENOSPC, but that wouldn't be
> obvious after they die, since the space would then be freed.

Parallel workers don't do anything with WAL files, but they can create
temporary files. If you're building humongous indexes with parallel
workers, you'll get some of those, but I don't think it'd be more than
you'd get without parallelism. If you were using up all of your disk
space with temporary files, wouldn't this be reproducible? I think
you said you were testing this repeatedly, so if that were the problem
I'd expect to see some non-panicky out-of-space errors when the temp
files blow out your disk space, and only rarely a panic if a
checkpoint happens to run exactly at a moment where the create index
hasn't yet written the byte that breaks the camel's back, but the
checkpoint pushes it over edge in one of these places where it panics
on failure.

--
Thomas Munro
https://enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2019-07-23 23:48:57 Re: stress test for parallel workers
Previous Message Tom Lane 2019-07-23 23:29:04 Re: stress test for parallel workers