Quick Links

Re: Race condition in pcp_node_info can cause it to hang

From:	Emond Papegaaij <emond(dot)papegaaij(at)gmail(dot)com>
To:	Tatsuo Ishii <ishii(at)postgresql(dot)org>
Cc:	pgpool-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: Race condition in pcp_node_info can cause it to hang
Date:	2026-06-05 11:49:32
Message-ID:	CAGXsc+akuig0oA7dJX5BNFVRn+5miTALRZMnPrrt3kY7ypB+Ew@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgpool-hackers

Hi,

Thanks for the quick followup!

Best regards,
Emond

Op vr 5 jun 2026 om 01:09 schreef Tatsuo Ishii <ishii(at)postgresql(dot)org>:
>
> Hi Emond,
>
> > Hi,
> >
> > We've hit another very rare flake in our tests, which can cause
> > pcp_node_info to hang indefinitely. I've analyzed the problem with
> > Claude Code, and it came to the conclusion and (quite small) fix
> > below. Attached is a patch against 4.7.
> >
> > The problem:
> > In inform_node_info() (src/pcp_con/pcp_worker.c), the PCP reply packet
> > reads bi->replication_state and bi->replication_sync_state directly
> > from shared memory twice: once via strlen() to compute the packet
> > length, and once via pcp_write() to write the payload.
> >
> > The streaming-replication check worker rewrites those same
> > shared-memory strings without a lock (it clears them to "" then
> > repopulates them every check cycle and on state transitions,
> > src/streaming_replication/pool_worker_child.c). If the string's length
> > changes between the two reads, the declared wsize no longer matches
> > the bytes actually written, so the PCP byte stream desynchronises. The
> > client then blocks forever in pcp_read() waiting for bytes the server
> > never sends.
> >
> > The fix:
> > Snapshot the two strings into local buffers once, right after bi =
> > pool_get_node_info(i),
> > and use the locals for both the length and the payload ― so a single
> > packet is always
> > internally consistent. This matches how every other field in the
> > packet is already
> > handled.
>
> Thank you for the report and fix. Yes, I agree there's a race
> condition between sr checker process and pcp_node_info. I think
> introducing a lock to protect bi->replication_state and
> bi->replication_sync_state is overkill. The suggested fix seems to be
> a right direction. Will push after current release freeze is over
> (supposed to be finished by the end of today).
>
> Regards,
> --
> Tatsuo Ishii
> SRA OSS K.K.
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp

In response to

Re: Race condition in pcp_node_info can cause it to hang at 2026-06-04 23:09:32 from Tatsuo Ishii

Responses

Re: Race condition in pcp_node_info can cause it to hang at 2026-06-07 03:14:51 from Tatsuo Ishii

Browse pgpool-hackers by date

	From	Date	Subject
Next Message	Tatsuo Ishii	2026-06-07 03:14:51	Re: Race condition in pcp_node_info can cause it to hang
Previous Message	Tatsuo Ishii	2026-06-05 04:49:06	Re: Proposal: Recent mutated table tracking in memory