Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Alexander Lakhin <exclusion(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
Subject: Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process
Date: 2026-05-14 21:47:26
Message-ID: CAD21AoBTfEaw8qfu2YExtQEZ8TNEJ9y2b=v5wan=h5bZ+EDVfQ@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, May 7, 2026 at 10:17 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> On Fri, May 1, 2026 at 1:00 AM Alexander Lakhin <exclusion(at)gmail(dot)com> wrote:
> >
> > Dear Sawada-san,
> >
> > 01.05.2026 01:08, Masahiko Sawada wrote:
> >
> > On Wed, Apr 29, 2026 at 11:00 AM Alexander Lakhin <exclusion(at)gmail(dot)com> wrote:
> >
> > I was wondering why is that failure the only one of this kind on buildfarm
> > (in last two years, at least), so I've tried to reproduce it on
> > REL_18_STABLE... and failed.
> >
> > Then I've bisected it on the master branch and found (your) commit that
> > introduced this behavior: 67c20979c from 2025-12-23.
> >
> > I've confirmed that this race condition issue is present from v15 to
> > the master. In v14, we have the procsignal barrier code but don't use
> > it anywhere. In v18 or older, it could happen when executing DROP
> > DATABASE, DROP TABLESPACE etc, whereas in the master, it could happen
> > in more cases as we're using procsignal barrier more places. In any
> > case, if a process emits a signal barrier when another process is
> > between the initialization of slot->pss_barrierGeneration and
> > slot->pss_pid initialization, the subsequent
> > WaitForProcSignalBarrier() ends up waiting for that process forever.
> > So I think the patch should be backpatched to v15. Please review these
> > patches.
> >
> >
> > Yes, you're right -- it's not reproduced on REL_18_STABLE with
> > test_oat_hooks, which simply starts postgres node (as many other tests),
> > but when I tried the full test suite with the sleep inserted before
> > setting pss_pid, I discovered the following vulnerable tests:
> >
> > 030_stats_cleanup_replica_standby.log
> > 2026-05-01 06:00:58.789 UTC [2086579] LOG: still waiting for backend with PID 2086578 to accept ProcSignalBarrier
> > 2026-05-01 06:00:58.789 UTC [2086579] CONTEXT: WAL redo at 0/3410B00 for Database/DROP: dir 1663/16393
> >
> > 033_replay_tsp_drops_standby2_FILE_COPY.log
> > 2026-05-01 05:45:12.969 UTC [2030902] LOG: still waiting for backend with PID 2030901 to accept ProcSignalBarrier
> > 2026-05-01 05:45:12.969 UTC [2030902] CONTEXT: WAL redo at 0/30006A8 for Database/CREATE_FILE_COPY: copy dir 1663/1 to 16384/16389
> >
> > 040_standby_failover_slots_sync_publisher.log
> > 2026-05-01 02:16:00.107 UTC [1538468] 040_standby_failover_slots_sync.pl LOG: still waiting for backend with PID 1538477 to accept ProcSignalBarrier
> > 2026-05-01 02:16:00.107 UTC [1538468] 040_standby_failover_slots_sync.pl STATEMENT: DROP DATABASE slotsync_test_db;
> >
> > 002_compare_backups_pitr1.log
> > 2026-05-01 04:50:46.638 UTC [1829328] LOG: still waiting for backend with PID 1829396 to accept ProcSignalBarrier
> > 2026-05-01 04:50:46.638 UTC [1829328] CONTEXT: WAL redo at 0/30A1DE0 for Database/DROP: dir 1663/16414
> >
> > I've tried my repro with 033_replay_tsp_drops and it really fails on
> > REL_15_STABLE..master and doesn't fail on REL_14_STABLE.
> >
> > FYI I found that we had a similar report[1] last year, I'm not sure
> > it hit the exact same issue, though.
> >
> > Regards,
> >
> > [1] https://www.postgresql.org/message-id/CAGQGyDTaVkG3DbTEbtyxZLM48jMZR2BcvTeYBsWLV5HvwSb+2Q@mail.gmail.com
> >
> >
> > Yeah, and probably this one:
> > https://www.postgresql.org/message-id/EF98BB5B-CA83-443E-B8A6-AA58EE4A06BB%40yandex-team.ru
> >
> > By the way, mamba produced the same failure just yesterday:
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2026-04-30%2005%3A10%3A39
> >
> > # Running: pg_ctl --wait --pgdata /home/buildfarm/bf-data/HEAD/pgsql.build/src/test/modules/commit_ts/tmp_check/t_004_restart_primary_data/pgdata --log /home/buildfarm/bf-data/HEAD/pgsql.build/src/test/modules/commit_ts/tmp_check/log/004_restart_primary.log --options --cluster-name=primary start
> > waiting for server to start........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................... stopped waiting
> > pg_ctl: server did not start in time
> > 004_restart_primary.log
> > 2026-04-30 04:09:04.025 EDT [17814:2] LOG: still waiting for backend with PID 11506 to accept ProcSignalBarrier
> > ...
> > 2026-04-30 04:19:55.336 EDT [17814:132] LOG: still waiting for backend with PID 11506 to accept ProcSignalBarrier
> >
> > The proposed patches make the test pass reliably for me in all affected
> > branches. Thank you for working on this!
> >
>
> Thank you for checking this issue on stable branches too!
>
> Considering that this issue is not very visible in practice and we're
> going to release new minor versions next week, I'm planning to push
> these fixes to master and backbranches after the minor releases. That
> way, we can fix the issue on the master relatively soon and have
> enough time to verify that fix works well on backbranches.
>

While reviewing the patches, I realized that it would be better to use
pg_atomic_write_membarrier_u32() instead of pg_atomic_write_u32() +
pg_memory_barrier() where available. I've updated the patch for master
and 18, and slightly commit messages.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment Content-Type Size
REL17_v1-0001-Fix-race-between-ProcSignalInit-and-EmitProcSigna.patch text/x-patch 2.9 KB
REL15_v1-0001-Fix-race-between-ProcSignalInit-and-EmitProcSigna.patch text/x-patch 2.9 KB
master_v1-0001-Fix-race-between-ProcSignalInit-and-EmitProcSigna.patch text/x-patch 3.0 KB
REL16_v1-0001-Fix-race-between-ProcSignalInit-and-EmitProcSigna.patch text/x-patch 2.9 KB
REL18_v1-0001-Fix-race-between-ProcSignalInit-and-EmitProcSigna.patch text/x-patch 3.0 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Davis 2026-05-14 21:58:07 Re: Refactor: allow pg_strncoll(), etc., to accept -1 length for NUL-terminated cstrings.
Previous Message Zsolt Parragi 2026-05-14 21:46:47 Re: Track skipped tables during autovacuum and autoanalyze