|From:||Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>|
|Cc:||Robert Haas <robertmhaas(at)gmail(dot)com>|
|Subject:||Dubious assertion in RegisterDynamicBackgroundWorker|
|Views:||Raw Message | Whole Thread | Download mbox | Resend email|
I noticed this recent crash on lorikeet:
The relevant bits of the log seem to be
2021-05-05 05:36:22.011 EDT [60926716.bb24:1] ERROR: could not map dynamic shared memory segment
2021-05-05 05:36:22.013 EDT [609266c5.b793:4] LOG: background worker "parallel worker" (PID 47908) exited with exit code 1
TRAP: FailedAssertion("BackgroundWorkerData->parallel_register_count - BackgroundWorkerData->parallel_terminate_count <= MAX_PARALLEL_WORKER_LIMIT", File: "/home/andrew/bf/root/REL_13_STABLE/pgsql.build/../pgsql/src/backend/postmaster/bgworker.c", Line: 1016)
*** starting debugger for pid 47743, tid 1264
2021-05-05 05:36:26.629 EDT [609266c5.b793:5] LOG: server process (PID 47743) exited with exit code 127
So we had a parallel worker fail to start, whereupon its leader went down
with an assertion failure. I know that the parallel-worker code is held
together with chewing gum and baling wire, but that's a bit much.
Looking at the indicated code, we find
* If this is a parallel worker, check whether there are already too many
* parallel workers; if so, don't register another one. Our view of
* parallel_terminate_count may be slightly stale, but that doesn't really
* matter: we would have gotten the same result if we'd arrived here
* slightly earlier anyway. There's no help for it, either, since the
* postmaster must not take locks; a memory barrier wouldn't guarantee
* anything useful.
if (parallel && (BackgroundWorkerData->parallel_register_count -
I would like to know on what grounds that Assert isn't insanity?
We just finished pointing out that we might see an old
parallel_terminate_count, which ISTM says specifically that
parallel_register_count minus parallel_terminate_count might
be larger than expected.
Admittedly, it seems unlikely that the difference could exceed
MAX_PARALLEL_WORKER_LIMIT = 1024 in a regression test run where
the limit on number of parallel workers is only 8. What I think is
more likely, given that these counters are unsigned, is that the
difference was actually negative. Which could be a bug, or it could
be an expectable race condition, or it could just be some flakiness
on lorikeet's part (that machine has had a lot of issues lately).
I trawled the buildfarm logs going back 180 days, and found no
other instances of this assertion, which seems to be evidence
in favor of the "lorikeet got flaky" theory. But it's not proof.
In any case, I see zero value in this assertion, so I propose
we remove it. If we don't remove it, it needs serious revision,
because it seems absolutely obvious to me that it could trigger
when there is nothing wrong. A system pushing the limit of
number of parallel workers would be at considerable risk.
regards, tom lane
|Next Message||Andrew Dunstan||2021-05-05 19:51:46||Re: COPY table_name (single_column) FROM 'unknown.txt' DELIMITER E'\n'|
|Previous Message||Matthias van de Meent||2021-05-05 19:43:01||Re: MaxOffsetNumber for Table AMs|