RE: Synchronizing slots from primary to standby

From: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
To: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, shveta malik <shveta(dot)malik(at)gmail(dot)com>, Ajin Cherian <itsajin(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Subject: RE: Synchronizing slots from primary to standby
Date: 2024-04-02 13:55:43
Message-ID: OS0PR01MB571654A90984729DC0A2FEC5943E2@OS0PR01MB5716.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tuesday, April 2, 2024 8:49 PM Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com> wrote:
>
> On Tue, Apr 2, 2024 at 2:11 PM Zhijie Hou (Fujitsu)
> <houzj(dot)fnst(at)fujitsu(dot)com> wrote:
> >
> > CFbot[1] complained about one query result's order in the tap-test, so I am
> > attaching a V7 patch set which fixed this. There are no changes in 0001.
> >
> > [1] https://cirrus-ci.com/task/6375962162495488
>
> Thanks. Here are some comments:

Thanks for the comments.

>
> 1. Can we just remove pg_logical_replication_slot_advance and use
> LogicalSlotAdvanceAndCheckSnapState instead? If worried about the
> function naming, LogicalSlotAdvanceAndCheckSnapState can be renamed to
> pg_logical_replication_slot_advance?
>
> + * Advance our logical replication slot forward. See
> + * LogicalSlotAdvanceAndCheckSnapState for details.
> */
> static XLogRecPtr
> pg_logical_replication_slot_advance(XLogRecPtr moveto)
> {

It was commented[1] that it's not appropriate for the
pg_logical_replication_slot_advance to have an out parameter
'ready_for_decoding' which looks bit awkward as the functionality doesn't match
the name, and is also not consistent with the style of
pg_physical_replication_slot_advance(). So, we added a new function.

>
> 2.
> + if (!ready_for_decoding)
> + {
> + elog(DEBUG1, "could not find consistent point for synced
> slot; restart_lsn = %X/%X",
> + LSN_FORMAT_ARGS(slot->data.restart_lsn));
>
> Can we specify the slot name in the message?

Added.

>
> 3. Also, can the "could not find consistent point for synced slot;
> restart_lsn = %X/%X" be emitted at LOG level just like other messages
> in update_and_persist_local_synced_slot. Although, I see "XXX should
> this be changed to elog(DEBUG1) perhaps?", these messages need to be
> at LOG level as they help debug issues if at all they are hit.

Changed to LOG and reworded the message.

>
> 4. How about using found_consistent_snapshot instead of
> ready_for_decoding? A general understanding is that the synced slots
> are not allowed for decoding (although with this fix, we do that for
> internal purposes), ready_for_decoding looks a bit misleading.

Agreed and renamed.

>
> 5. As far as the test case for this issue is concerned, I'm fine with
> adding one using an INJECTION point because we seem to be having no
> consistent way to control postgres writing current snapshot to WAL.

Since me and my colleagues can reproduce the issue consistently after applying
0002 and it could be rare for concurrent xl_running_xacts to happen, we discussed[2] to
consider adding the INJECTION point after pushing the main fix.

>
> 6. A nit: can we use "fast_forward mode" instead of "fast-forward
> mode" just to be consistent?
> + * logical changes unless we are in fast-forward mode where no changes
> are
>
> 7.
> + /*
> + * We need to access the system tables during decoding to build the
> + * logical changes unless we are in fast-forward mode where no changes
> are
> + * generated.
> + */
> + if (slot->data.database != MyDatabaseId && !fast_forward)
>
> May I know if we need this change for this fix?

The slotsync worker needs to advance the slots from different databases in
fast_forward. So, we need to skip this check in fast_forward mode. The analysis can
be found in [3].

Attach the V8 patch which addressed above comments.

[1] https://www.postgresql.org/message-id/CAA4eK1%2BwkaRi2BrLLC_0gKbHN68Awc9dRp811G3An6A6fuqdOg%40mail.gmail.com
[2] https://www.postgresql.org/message-id/ZgvI9iAUWCZ17z5V%40ip-10-97-1-34.eu-west-3.compute.internal
[3] https://www.postgresql.org/message-id/CAJpy0uCQ2PDCAqcnbdOz6q_ZqmBfMyBpVqKDqL_XZBP%3DeK-1yw%40mail.gmail.com

Best Regards,
Hou zj

Attachment Content-Type Size
v8-0002-test-the-data-loss-case.patch application/octet-stream 5.0 KB
v8-0001-advance-the-restart_lsn-of-synced-slots-using-log.patch application/octet-stream 23.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Gustafsson 2024-04-02 13:56:13 Re: Confusing #if nesting in hmac_openssl.c
Previous Message Tom Lane 2024-04-02 13:50:18 Re: Confusing #if nesting in hmac_openssl.c