Re: Minimal logical decoding on standbys

From: "Drouvot, Bertrand" <bertranddrouvot(dot)pg(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Ibrar Ahmed <ibrar(dot)ahmad(at)gmail(dot)com>, Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>, fabriziomello(at)gmail(dot)com, tushar <tushar(dot)ahuja(at)enterprisedb(dot)com>, Rahila Syed <rahila(dot)syed(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Minimal logical decoding on standbys
Date: 2023-01-10 08:33:21
Message-ID: 6f8503b7-93d6-e78d-1d61-09cfec4d59c9@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 1/6/23 4:40 AM, Andres Freund wrote:
> Hi,
> 0002:
>
>
>> +/*
>> + * Helper for InvalidateConflictingLogicalReplicationSlot -- acquires the given slot
>> + * and mark it invalid, if necessary and possible.
>> + *
>> + * Returns whether ReplicationSlotControlLock was released in the interim (and
>> + * in that case we're not holding the lock at return, otherwise we are).
>> + *
>> + * This is inherently racy, because we release the LWLock
>> + * for syscalls, so caller must restart if we return true.
>> + */
>> +static bool
>> +InvalidatePossiblyConflictingLogicalReplicationSlot(ReplicationSlot *s, TransactionId xid)
>
> This appears to be a near complete copy of InvalidatePossiblyObsoleteSlot(). I
> don't think we should have two versions of that non-trivial code. Seems we
> could just have an additional argument for InvalidatePossiblyObsoleteSlot()?

Good point, done in V37 attached.
The new logical slot invalidation handling has been "merged" with the obsolete LSN case into 2 new
functions (InvalidateObsoleteOrConflictingLogicalReplicationSlots() and InvalidatePossiblyObsoleteOrConflictingLogicalSlot()),
removing InvalidateObsoleteReplicationSlots() and InvalidatePossiblyObsoleteSlot().

>
>
>> + ereport(LOG,
>> + (errmsg("invalidating slot \"%s\" because it conflicts with recovery", NameStr(slotname))));
>> +
>
> I think this should report more details, similar to what
> InvalidateObsoleteReplicationSlots() does.
>

Agree, done in V37 (adding more details about the xid horizon and wal_level < logical on master cases).

>
>> --- a/src/backend/replication/logical/logicalfuncs.c
>> +++ b/src/backend/replication/logical/logicalfuncs.c
>> @@ -216,11 +216,14 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
>>
>> /*
>> * After the sanity checks in CreateDecodingContext, make sure the
>> - * restart_lsn is valid. Avoid "cannot get changes" wording in this
>> + * restart_lsn is valid or both xmin and catalog_xmin are valid.
>> + * Avoid "cannot get changes" wording in this
>> * errmsg because that'd be confusingly ambiguous about no changes
>> * being available.
>> */
>> - if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
>> + if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn)
>> + || (!TransactionIdIsValid(MyReplicationSlot->data.xmin)
>> + && !TransactionIdIsValid(MyReplicationSlot->data.catalog_xmin)))
>> ereport(ERROR,
>> (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
>> errmsg("can no longer get changes from replication slot \"%s\"",
>
> Hm. Feels like we should introduce a helper like SlotIsInvalidated() instead
> of having this condition in a bunch of places.
>

Agree, LogicalReplicationSlotIsInvalid() has been added in V37.

>> + if (!TransactionIdIsValid(MyReplicationSlot->data.xmin)
>> + && !TransactionIdIsValid(MyReplicationSlot->data.catalog_xmin))
>> + ereport(ERROR,
>> + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
>> + errmsg("cannot read from logical replication slot \"%s\"",
>> + cmd->slotname),
>> + errdetail("This slot has been invalidated because it was conflicting with recovery.")));
>> +
>
> This is a more precise error than the one in
> pg_logical_slot_get_changes_guts().
>
> I think both places should output the same error.

Agree, done in V37 attached.

> ISTM that the relevant code
> should be in CreateDecodingContext(). Imo the code to deal with the WAL
> version of this has been misplaced...
>

Looks like a good idea. I'll start a dedicated thread to move the already existing
error reporting code part of pg_logical_slot_get_changes_guts() and StartLogicalReplication() into CreateDecodingContext().

>> --- a/src/backend/storage/ipc/procarray.c
>> +++ b/src/backend/storage/ipc/procarray.c
>> @@ -3477,6 +3477,10 @@ SignalVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode,
>>
>> GET_VXID_FROM_PGPROC(procvxid, *proc);
>>
>> + /*
>> + * Note: vxid.localTransactionId can be invalid, which means the
>> + * request is to signal the pid that is not running a transaction.
>> + */
>> if (procvxid.backendId == vxid.backendId &&
>> procvxid.localTransactionId == vxid.localTransactionId)
>> {
>
> I can't really parse the comment.
>

Looks like it's there since a long time ago (before I started working on this thread).
I did not pay that much attention to it, but now that you say it I'm not
sure why it as been previously added.

Given that 1) I don't get it too and 2) that this comment is the only one modification in this file then V37 just removes it.

>> @@ -500,6 +502,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
>> PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
>> WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
>> true);
>> +
>> + if (isCatalogRel)
>> + InvalidateConflictingLogicalReplicationSlots(locator.dbOid, snapshotConflictHorizon);
>> }
>
> Might be worth checking if wal_level >= logical before the somewhat expensive
> InvalidateConflictingLogicalReplicationSlots().
>

Good point, done in V37.

>
>> @@ -3051,6 +3054,25 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
>> case PROCSIG_RECOVERY_CONFLICT_LOCK:
>> case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
>> case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
>> + case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
>> + /*
>> + * For conflicts that require a logical slot to be invalidated, the
>> + * requirement is for the signal receiver to release the slot,
>> + * so that it could be invalidated by the signal sender. So for
>> + * normal backends, the transaction should be aborted, just
>> + * like for other recovery conflicts. But if it's walsender on
>> + * standby, then it has to be killed so as to release an
>> + * acquired logical slot.
>> + */
>> + if (am_cascading_walsender &&
>> + reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
>> + MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
>> + {
>> + RecoveryConflictPending = true;
>> + QueryCancelPending = true;
>> + InterruptPending = true;
>> + break;
>> + }
>>
>> /*
>> * If we aren't in a transaction any longer then ignore.
>
> Why does the walsender need to be killed? I think it might just be that
> IsTransactionOrTransactionBlock() might return false, even though we want to
> cancel. The code actually seems to only cancel (QueryCancelPending is set
> rather than ProcDiePending), but the comment talks about killing?
>

Oh right, this comment is also there since a long time ago. I think the code
is OK (as we break in that case and so we don't go through IsTransactionOrTransactionBlock()).

So, V37 just modifies the comment.

Please find attached, V37 taking care of:

0001: commit message modifications and renaming VISIBILITYMAP_ON_CATALOG_ACCESSIBLE_IN_LOGICAL_DECODING
to VISIBILITYMAP_IS_CATALOG_REL (It does not touch the other remarks as they are still discussed in [1]).

0002: All your remarks mentioned above.

I'll look at the ones you've done in [2] on 0003, 0004 and 0006.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

[1]: https://www.postgresql.org/message-id/5c5151a6-a1a3-6c38-7d68-543c9faa22f4%40gmail.com
[2]: https://www.postgresql.org/message-id/20230106034036.2m4qnn7ep7b5ipet%40awork3.anarazel.de

Attachment Content-Type Size
v37-0006-Fixing-Walsender-corner-case-with-logical-decodi.patch text/plain 7.5 KB
v37-0005-Doc-changes-describing-details-about-logical-dec.patch text/plain 2.1 KB
v37-0004-New-TAP-test-for-logical-decoding-on-standby.patch text/plain 20.4 KB
v37-0003-Allow-logical-decoding-on-standby.patch text/plain 11.5 KB
v37-0002-Handle-logical-slot-conflicts-on-standby.patch text/plain 32.4 KB
v37-0001-Add-info-in-WAL-records-in-preparation-for-logic.patch text/plain 33.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dean Rasheed 2023-01-10 08:33:53 Re: [PATCH] random_normal function
Previous Message Ankit Kumar Pandey 2023-01-10 08:31:52 Re: Todo: Teach planner to evaluate multiple windows in the optimal order