Re: 035_standby_logical_decoding unbounded hang

From: Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: 035_standby_logical_decoding unbounded hang
Date: 2024-02-14 15:31:16
Message-ID: ZczcxIOes4s6B7QA@ip-10-97-1-34.eu-west-3.compute.internal
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Sat, Feb 10, 2024 at 05:02:27PM -0800, Noah Misch wrote:
> Coincidentally, one of my buildfarm animals hanged several weeks in a
> different test, 035_standby_logical_decoding.pl. A LOG_SNAPSHOT_INTERVAL_MS
> reduction was part of making it reproducible:
>
> On Fri, Feb 02, 2024 at 04:01:45PM -0800, Noah Misch wrote:
> > On Fri, Feb 02, 2024 at 02:30:03PM -0800, Noah Misch wrote:
> > > On Fri, Feb 02, 2024 at 05:07:14PM -0500, Tom Lane wrote:
> > > > If you look at the buildfarm's failures page and filter down to
> > > > just subscriptionCheck failures, what you find is that all of the
> > > > last 6 such failures are in 031_column_list.pl:
>
> > > https://www.postgresql.org/message-id/flat/16d6d9cc-f97d-0b34-be65-425183ed3721%40gmail.com
> > > reported a replacement BgWriterDelay value reproducing it.
> >
> > Correction: the recipe changes LOG_SNAPSHOT_INTERVAL_MS in addition to
> > BgWriterDelay.
>
> I'm reusing this thread just in case there's overlap with the
> 031_column_list.pl cause and fix. The 035_standby_logical_decoding.pl hang is
> a race condition arising from an event sequence like this:
>
> - Test script sends CREATE SUBSCRIPTION to subscriber, which loses the CPU.
> - Test script calls pg_log_standby_snapshot() on primary. Emits XLOG_RUNNING_XACTS.
> - checkpoint_timeout makes a primary checkpoint finish. Emits XLOG_RUNNING_XACTS.
> - bgwriter executes LOG_SNAPSHOT_INTERVAL_MS logic. Emits XLOG_RUNNING_XACTS.
> - CREATE SUBSCRIPTION wakes up and sends CREATE_REPLICATION_SLOT to standby.
>
> Other test code already has a solution for this, so the attached patches add a
> timeout and copy the existing solution. I'm also attaching the hack that
> makes it 100% reproducible.

Thanks!

I did a few tests and confirm that the proposed solution fixes the corner case.

standby-slot-test-1-timeout-v1.patch LGTM.

Regarding standby-slot-test-2-race-v1.patch:

> +# See corresponding create_logical_slot_on_standby() code.
> +$node_standby->poll_query_until(
> + 'postgres', qq[
> + SELECT restart_lsn IS NOT NULL
> + FROM pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub'
> + ]) or die "timed out waiting for logical slot to calculate its restart_lsn";
> +

What about creating a sub, say wait_for_restart_lsn_calculation() in Cluster.pm
and then make use of it in create_logical_slot_on_standby() and above? (something
like wait_for_restart_lsn_calculation-v1.patch attached).

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment Content-Type Size
wait_for_restart_lsn_calculation-v1.patch text/x-diff 2.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jelte Fennema-Nio 2024-02-14 16:35:13 Add trim_trailing_whitespace to editorconfig file
Previous Message Tom Lane 2024-02-14 15:04:49 Re: Fix a typo in pg_rotate_logfile