Re: "could not reattach to shared memory" on buildfarm member dory

From: Noah Misch <noah(at)leadboat(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Heath Lord <heath(dot)lord(at)crunchydata(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: "could not reattach to shared memory" on buildfarm member dory
Date: 2018-08-19 02:00:07
Message-ID: 20180819020007.GD3795674@rfd.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, May 01, 2018 at 11:31:50AM -0400, Tom Lane wrote:
> Well, at this point the only thing that's entirely clear is that none
> of the ideas I had work. I think we are going to be forced to pursue
> Noah's idea of doing an end-to-end retry. Somebody else will need to
> take point on that; I lack a Windows environment and have already done
> a lot more blind patch-pushing than I like in this effort.

Having tried this, I find a choice between performance and complexity. Both
of my designs use proc_exit(4) to indicate failure to reattach. The simpler,
slower design has WIN32 internal_forkexec() block until the child reports (via
SetEvent()) that it reattached to shared memory. This caused a fivefold
reduction in process creation performance[1]. The less-simple, faster design
stashes the Port structure and retry count in the BackendList entry, which
reaper() uses to retry the fork upon seeing status 4. Notably, this requires
new code for regular backends, for bgworkers, and for others. It's currently
showing a 30% performance _increase_ on the same benchmark; I can't explain
that increase and doubt it will last, but I think it's plausible for the
less-simple design to be performance-neutral.

I see these options:

1. Use the simpler design with a GUC, disabled by default, to control whether
the new code is active. Mention the GUC in a new errhint() for the "could
not reattach to shared memory" error.

2. Like (1), but enable the GUC by default.

3. Like (1), but follow up with a patch to enable the GUC by default in v12
only.

4. In addition to (1), enable retries if the GUC is set _or_ this postmaster
has seen at least one child fail to reattach.

5. Use the less-simple design, with retries enabled unconditionally.

I think I prefer (3), with (1) being a close second. My hesitation on (3) is
that parallel query has made startup time count even if you use a connection
pool, and all the Windows users not needing these retries will see parallel
query become that much slower. I dislike (5) for its impact on
platform-independent postmaster code. Other opinions?

I'm attaching a mostly-finished patch for the slower design. I tested
correctness with -DREATTACH_FORCE_FAIL_PERCENT=99. I'm also attaching a
proof-of-concept patch for the faster design. In this proof of concept, the
postmaster does not close its copy of a backend socket until the backend
exits. Also, bgworkers can change from BGWH_STARTED back to
BGWH_NOT_YET_STARTED; core code tolerates this, but external code may not.
Those would justify paying some performance to fix. The proof of concept
handles bgworkers and regular backends, but it does not handle the startup
process, checkpointer, etc. That doesn't affect benchmarking, of course.

nm

[1] This (2 forks per transaction) dropped from 139tps to 27tps:
echo 'select 1' >script
env PGOPTIONS="--default_transaction_isolation=repeatable\\ read --force_parallel_mode=on" pgbench -T15 -j30 -c30 --connect -n -fscript

Attachment Content-Type Size
w32attach-retry-block-v1.patch text/plain 6.5 KB
w32attach-retry-nonblock-v0.patch text/plain 23.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jonathan S. Katz 2018-08-19 02:02:08 Re: Fix for REFRESH MATERIALIZED VIEW ownership error message
Previous Message Jonathan S. Katz 2018-08-19 00:52:45 Re: Fix for REFRESH MATERIALIZED VIEW ownership error message