| From: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> |
|---|---|
| To: | Alexander Lakhin <exclusion(at)gmail(dot)com> |
| Cc: | Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Andrey Borodin <x4mmm(at)yandex-team(dot)ru> |
| Subject: | Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process |
| Date: | 2026-05-14 21:47:26 |
| Message-ID: | CAD21AoBTfEaw8qfu2YExtQEZ8TNEJ9y2b=v5wan=h5bZ+EDVfQ@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Thu, May 7, 2026 at 10:17 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> On Fri, May 1, 2026 at 1:00 AM Alexander Lakhin <exclusion(at)gmail(dot)com> wrote:
> >
> > Dear Sawada-san,
> >
> > 01.05.2026 01:08, Masahiko Sawada wrote:
> >
> > On Wed, Apr 29, 2026 at 11:00 AM Alexander Lakhin <exclusion(at)gmail(dot)com> wrote:
> >
> > I was wondering why is that failure the only one of this kind on buildfarm
> > (in last two years, at least), so I've tried to reproduce it on
> > REL_18_STABLE... and failed.
> >
> > Then I've bisected it on the master branch and found (your) commit that
> > introduced this behavior: 67c20979c from 2025-12-23.
> >
> > I've confirmed that this race condition issue is present from v15 to
> > the master. In v14, we have the procsignal barrier code but don't use
> > it anywhere. In v18 or older, it could happen when executing DROP
> > DATABASE, DROP TABLESPACE etc, whereas in the master, it could happen
> > in more cases as we're using procsignal barrier more places. In any
> > case, if a process emits a signal barrier when another process is
> > between the initialization of slot->pss_barrierGeneration and
> > slot->pss_pid initialization, the subsequent
> > WaitForProcSignalBarrier() ends up waiting for that process forever.
> > So I think the patch should be backpatched to v15. Please review these
> > patches.
> >
> >
> > Yes, you're right -- it's not reproduced on REL_18_STABLE with
> > test_oat_hooks, which simply starts postgres node (as many other tests),
> > but when I tried the full test suite with the sleep inserted before
> > setting pss_pid, I discovered the following vulnerable tests:
> >
> > 030_stats_cleanup_replica_standby.log
> > 2026-05-01 06:00:58.789 UTC [2086579] LOG: still waiting for backend with PID 2086578 to accept ProcSignalBarrier
> > 2026-05-01 06:00:58.789 UTC [2086579] CONTEXT: WAL redo at 0/3410B00 for Database/DROP: dir 1663/16393
> >
> > 033_replay_tsp_drops_standby2_FILE_COPY.log
> > 2026-05-01 05:45:12.969 UTC [2030902] LOG: still waiting for backend with PID 2030901 to accept ProcSignalBarrier
> > 2026-05-01 05:45:12.969 UTC [2030902] CONTEXT: WAL redo at 0/30006A8 for Database/CREATE_FILE_COPY: copy dir 1663/1 to 16384/16389
> >
> > 040_standby_failover_slots_sync_publisher.log
> > 2026-05-01 02:16:00.107 UTC [1538468] 040_standby_failover_slots_sync.pl LOG: still waiting for backend with PID 1538477 to accept ProcSignalBarrier
> > 2026-05-01 02:16:00.107 UTC [1538468] 040_standby_failover_slots_sync.pl STATEMENT: DROP DATABASE slotsync_test_db;
> >
> > 002_compare_backups_pitr1.log
> > 2026-05-01 04:50:46.638 UTC [1829328] LOG: still waiting for backend with PID 1829396 to accept ProcSignalBarrier
> > 2026-05-01 04:50:46.638 UTC [1829328] CONTEXT: WAL redo at 0/30A1DE0 for Database/DROP: dir 1663/16414
> >
> > I've tried my repro with 033_replay_tsp_drops and it really fails on
> > REL_15_STABLE..master and doesn't fail on REL_14_STABLE.
> >
> > FYI I found that we had a similar report[1] last year, I'm not sure
> > it hit the exact same issue, though.
> >
> > Regards,
> >
> > [1] https://www.postgresql.org/message-id/CAGQGyDTaVkG3DbTEbtyxZLM48jMZR2BcvTeYBsWLV5HvwSb+2Q@mail.gmail.com
> >
> >
> > Yeah, and probably this one:
> > https://www.postgresql.org/message-id/EF98BB5B-CA83-443E-B8A6-AA58EE4A06BB%40yandex-team.ru
> >
> > By the way, mamba produced the same failure just yesterday:
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2026-04-30%2005%3A10%3A39
> >
> > # Running: pg_ctl --wait --pgdata /home/buildfarm/bf-data/HEAD/pgsql.build/src/test/modules/commit_ts/tmp_check/t_004_restart_primary_data/pgdata --log /home/buildfarm/bf-data/HEAD/pgsql.build/src/test/modules/commit_ts/tmp_check/log/004_restart_primary.log --options --cluster-name=primary start
> > waiting for server to start........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................... stopped waiting
> > pg_ctl: server did not start in time
> > 004_restart_primary.log
> > 2026-04-30 04:09:04.025 EDT [17814:2] LOG: still waiting for backend with PID 11506 to accept ProcSignalBarrier
> > ...
> > 2026-04-30 04:19:55.336 EDT [17814:132] LOG: still waiting for backend with PID 11506 to accept ProcSignalBarrier
> >
> > The proposed patches make the test pass reliably for me in all affected
> > branches. Thank you for working on this!
> >
>
> Thank you for checking this issue on stable branches too!
>
> Considering that this issue is not very visible in practice and we're
> going to release new minor versions next week, I'm planning to push
> these fixes to master and backbranches after the minor releases. That
> way, we can fix the issue on the master relatively soon and have
> enough time to verify that fix works well on backbranches.
>
While reviewing the patches, I realized that it would be better to use
pg_atomic_write_membarrier_u32() instead of pg_atomic_write_u32() +
pg_memory_barrier() where available. I've updated the patch for master
and 18, and slightly commit messages.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
| Attachment | Content-Type | Size |
|---|---|---|
| REL17_v1-0001-Fix-race-between-ProcSignalInit-and-EmitProcSigna.patch | text/x-patch | 2.9 KB |
| REL15_v1-0001-Fix-race-between-ProcSignalInit-and-EmitProcSigna.patch | text/x-patch | 2.9 KB |
| master_v1-0001-Fix-race-between-ProcSignalInit-and-EmitProcSigna.patch | text/x-patch | 3.0 KB |
| REL16_v1-0001-Fix-race-between-ProcSignalInit-and-EmitProcSigna.patch | text/x-patch | 2.9 KB |
| REL18_v1-0001-Fix-race-between-ProcSignalInit-and-EmitProcSigna.patch | text/x-patch | 3.0 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Jeff Davis | 2026-05-14 21:58:07 | Re: Refactor: allow pg_strncoll(), etc., to accept -1 length for NUL-terminated cstrings. |
| Previous Message | Zsolt Parragi | 2026-05-14 21:46:47 | Re: Track skipped tables during autovacuum and autoanalyze |