Re: Minimal logical decoding on standbys

From: "Drouvot, Bertrand" <bertranddrouvot(dot)pg(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Jeff Davis <pgsql(at)j-davis(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Ibrar Ahmed <ibrar(dot)ahmad(at)gmail(dot)com>, Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>, fabriziomello(at)gmail(dot)com, tushar <tushar(dot)ahuja(at)enterprisedb(dot)com>, Rahila Syed <rahila(dot)syed(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Minimal logical decoding on standbys
Date: 2023-04-07 15:13:13
Message-ID: 1e923731-ed09-3621-c8b9-61f938b4c355@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 4/7/23 9:50 AM, Andres Freund wrote:
> Hi,
> Here's my current working state - I'll go to bed soon.

Thanks a lot for this Andres!

>
> Changes:
>
> - shared catalog relations weren't handled correctly, because the dboid is
> InvalidOid for them. I wrote a test for that as well.
>
> - ReplicationSlotsComputeRequiredXmin() took invalidated logical slots into
> account (ReplicationSlotsComputeLogicalRestartLSN() too, but it never looks
> at logical slots)
>
> - I don't think the subset of slot xids that were checked when invalidating
> was right. We need to check effective_xmin and effective_catalog_xmin - the
> latter was using catalog_xmin.
>
> - similarly, it wasn't right that specifically those two fields were
> overwritten when invalidated - as that was done, I suspect the changes might
> get lost on a restart...
>
> - As mentioned previously, I did not like all the functions in slot.h, nor
> their naming. Not yet quite finished with that, but a good bit further
>
> - There were a lot of unrelated changes, e.g. removing comments like
> * NB - this runs as part of checkpoint, so avoid raising errors if possible.
>
> - I still don't like the order of the patches, fixing the walsender patches
> after introducing support for logical decoding on standby. Reordered.
>
> - I don't think logical slots being invalidated as checked e.g. in
> pg_logical_replication_slot_advance()
>
> - I didn't like much that InvalidatePossiblyObsoleteSlot() switched between
> kill() and SendProcSignal() based on the "conflict". There very well could
> be reasons to use InvalidatePossiblyObsoleteSlot() with an xid from outside
> of the startup process in the future. Instead I made it differentiate based
> on MyBackendType == B_STARTUP.
>

Thanks for all of this and the above explanations.

>
> I also:
>
> Added new patch that replaces invalidated_at with a new enum, 'invalidated',
> listing the reason for the invalidation.

Yeah, that's a great idea.

> I added a check for !invalidated to
> ReplicationSlotsComputeRequiredLSN() etc.
>

looked at 65-0001 and it looks good to me.

> Added new patch moving checks for invalid logical slots into
> CreateDecodingContext(). Otherwise we end up with 5 or so checks, which makes
> no sense. As far as I can tell the old message in
> pg_logical_slot_get_changes_guts() was bogus, one couldn't get there having
> "never previously reserved WAL"
>

looked at 65-0002 and it looks good to me.

> Split "Handle logical slot conflicts on standby." into two. I'm not sure that
> should stay that way, but it made it easier to hack on
> InvalidateObsoleteReplicationSlots.
>

looked at 65-0003 and the others.

It's easier to understand/read the code now that the ReplicationSlotInvalidationCause
enum has been created and that data.invalidated also make use of the enum. It does "simplify"
the review and that looks good to me.

>
> Todo:
> - write a test that invalidated logical slots stay invalidated across a restart

Done in 65-66-0008 attached.

> - write a test that invalidated logical slots do not lead to retaining WAL

I'm not sure how to do that since pg_switch_wal() and friends can't be executed on
a standby.

> - Further evolve the API of InvalidateObsoleteReplicationSlots()
> - pass in the ReplicationSlotInvalidationCause we're trying to conflict on?
> - rename xid to snapshotConflictHorizon, that'd be more in line with the
> ResolveRecoveryConflictWithSnapshot and easier to understand, I think
>

Done. The new API can be found in v65-66-InvalidateObsoleteReplicationSlots_API.patch
attached. It propagates the cause to InvalidatePossiblyObsoleteSlot() where a switch/case
can now be used. The "default" case does not emit an error since this code runs as part
of checkpoint.

> - The test could stand a bit of cleanup and consolidation
> - No need to start 4 psql processes to do 4 updates, just do it in one
> safe_psql()

Right, done in v65-66-0008-New-TAP-test-for-logical-decoding-on-standby.patch attached.

> - the sequence of drop_logical_slots(), create_logical_slots(),
> change_hot_standby_feedback_and_wait_for_xmins(), make_slot_active() is
> repeated quite a few times

grouped in reactive_slots_change_hfs_and_wait_for_xmins() in 65-66-0008 attached.

> - the stats queries checking for specific conflict counts, including
> preceding tests, is pretty painful. I suggest to reset the stats at the
> end of the test instead (likely also do the drop_logical_slot() there).

Good idea, done in 65-66-0008 attached.

> - it's hard to correlate postgres log and the tap test, because the slots
> are named the same across all tests. Perhaps they could have a per-test
> prefix?

Good point. Done in 65-66-0008 attached. Thanks to that and the stats reset the
check for invalidation is now done in a single function "check_for_invalidation" that looks
for invalidation messages in the logfile and in pg_stat_database_conflicts.

Thanks for the suggestions: the TAP test is now easier to read/understand.

> - numbering tests is a PITA, I had to renumber the later ones, when adding a
> test for shared catalog tables

Yeah, sorry about that, it has been fixed in V63.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment Content-Type Size
v65-66-InvalidateObsoleteReplicationSlots_API.patch text/plain 5.7 KB
v65-66-0008-New-TAP-test-for-logical-decoding-on-standby.patch text/plain 27.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Julien Rouhaud 2023-04-07 15:29:44 Re: [PoC] pg_upgrade: allow to upgrade publisher node
Previous Message Andres Freund 2023-04-07 15:04:13 Re: Making background psql nicer to use in tap tests