Re: Logical decoding on standby

From: Craig Ringer <craig(at)2ndquadrant(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Simon Riggs <simon(dot)riggs(at)2ndquadrant(dot)com>, Thom Brown <thom(at)linux(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Logical decoding on standby
Date: 2017-04-03 05:46:08
Views: Raw Message | Whole Thread | Download mbox
Lists: pgsql-hackers

On 31 March 2017 at 12:49, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:
> On 31 March 2017 at 01:16, Andres Freund <andres(at)anarazel(dot)de> wrote:
>>> @@ -9633,6 +9643,12 @@ xlog_redo(XLogReaderState *record)
>>> SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
>>> /*
>>> + * There can be no concurrent writers to oldestCatalogXmin during
>>> + * recovery, so no need to take ProcArrayLock.
>>> + */
>>> + ShmemVariableCache->oldestCatalogXmin = checkPoint.oldestCatalogXmin;
>> s/writers/writes/?
> I meant writers, i.e. nothing else can be writing to it. But writes works too.


>>> @@ -276,8 +279,21 @@ CreateInitDecodingContext(char *plugin,
>>> ReplicationSlotsComputeRequiredXmin(true);
>>> + /*
>>> + * If this is the first slot created on the master we won't have a
>>> + * persistent record of the oldest safe xid for historic snapshots yet.
>>> + * Force one to be recorded so that when we go to replay from this slot we
>>> + * know it's safe.
>>> + */
>>> + force_standby_snapshot =
>>> + !TransactionIdIsValid(ShmemVariableCache->oldestCatalogXmin);
>> s/first slot/first logical slot/. Also, the reference to master doesn't
>> seem right.
> Unsure what you mean re reference to master not seeming right.
> If oldestCatalogXmin is 0 we'll ERROR when trying to start decoding
> from the new slot so we need to make sure it gets advanced one we've
> decided on our starting catalog_xmin.

Moved to next patch, will address there.

>>> LWLockRelease(ProcArrayLock);
>>> + /* Update ShmemVariableCache->oldestCatalogXmin */
>>> + if (force_standby_snapshot)
>>> + LogStandbySnapshot();
>> The comment and code don't quite square to me - it's far from obvious
>> that LogStandbySnapshot does something like that. I'd even say it's a
>> bad idea to have it do that.
> So you prefer the prior approach with separate xl_catalog_xmin advance records?
> I don't have much preference; I liked the small code reduction of
> Simon's proposed approach, but it landed up being a bit awkward in
> terms of ordering and locking. I don't think catalog_xmin tracking is
> really closely related to the standby snapshot stuff and it feels a
> bit like it's a tacked-on afterthought where it is now.

This code moved to next patch. But we do need to agree on the best approach.

If we're not going to force a standby snapshot here, then it's
probably better to use the separate xl_catalog_xmin design instead.

>>> /*
>>> * tell the snapshot builder to only assemble snapshot once reaching the
>>> * running_xact's record with the respective xmin.
>>> @@ -376,6 +392,8 @@ CreateDecodingContext(XLogRecPtr start_lsn,
>>> start_lsn = slot->data.confirmed_flush;
>>> }
>>> + EnsureActiveLogicalSlotValid();
>> This seems like it should be in a separate patch, and seperately
>> reviewed. It's code that's currently unreachable (and thus untestable).
> It is reached and is run, those checks run whenever creating a
> non-initial decoding context on master or replica.

Again, moved to next patch.

>>> /*
>>> + * Update our knowledge of the oldest xid we can safely create historic
>>> + * snapshots for.
>>> + *
>>> + * There can be no concurrent writers to oldestCatalogXmin during
>>> + * recovery, so no need to take ProcArrayLock.
>> By now I think is pretty flawed logic, because there can be concurrent
>> readers, that need to be safe against oldestCatalogXmin advancing
>> concurrently.
> You're right, we'll need a lock or suitable barriers here to ensure
> that slot conflict with recovery and startup of new decoding sessions
> doesn't see outdated values. This would be the peer of the
> pg_memory_barrier() above. Or could just take a lock; there's enough
> other locking activity in redo that it should be fine.

Now takes ProcArrayLock briefly.

oldestCatalogXmin is also used in GetOldestSafeDecodingTransactionId,
and there we want to prevent it from being advanced. But on further
thought, relying on oldestCatalogXmin there is actually unsafe; on the
master, we might've already logged our intent to advance it to some
greater value of procArray->replication_slot_catalog_xmin and will do
so as ProcArrayLock is released. On standby we're also better off
relying on procArray->replication_slot_catalog_xmin since that's what
we'll be sending in feedback.

Went back to using replication_slot_catalog_xmin here, with comment

* We don't use ShmemVariableCache->oldestCatalogXmin here because another
* backend may have already logged its intention to advance it to a higher
* value (still <= replication_slot_catalog_xmin) and just be waiting on
* ProcArrayLock to actually apply the change. On a standby
* replication_slot_catalog_xmin is what the walreceiver will be sending in
* hot_standby_feedback, not oldestCatalogXmin.

>>> /*
>>> - * It's important *not* to include the limits set by slots here because
>>> + * It's important *not* to include the xmin set by slots here because
>>> * snapbuild.c uses oldestRunningXid to manage its xmin horizon. If those
>>> * were to be included here the initial value could never increase because
>>> * of a circular dependency where slots only increase their limits when
>>> * running xacts increases oldestRunningXid and running xacts only
>>> * increases if slots do.
>>> + *
>>> + * We can include the catalog_xmin limit here; there's no similar
>>> + * circularity, and we need it to log xl_running_xacts records for
>>> + * standbys.
>>> */
>> Those comments seem to need some more heavyhanded reconciliation.
> OK. To me it seems clear that the first refers to xmin, the second to
> catalog_xmin. But after all I wrote it, and the important thing is
> what it says to people who are not me. Will adjust.

Changed to

* We can safely report the catalog_xmin limit for replication slots here
* because it's only used to advance oldestCatalogXmin. Slots'
* catalog_xmin advance does not depend on it so there's no circularity.

>>> *
>>> * Return the current slot xmin limits. That's useful to be able to remove
>>> * data that's older than those limits.
>>> + *
>>> + * For logical replication slots' catalog_xmin, we return both the effective
>> This seems to need some light editing.
> catalog_xmin => catalog_xmins I guess.


>>> /*
>>> * Record an enhanced snapshot of running transactions into WAL.
>>> *
>>> + * We also record the value of procArray->replication_slot_catalog_xmin
>>> + * obtained from GetRunningTransactionData here, so standbys know we're about
>>> + * to advance ShmemVariableCache->oldestCatalogXmin to its value and start
>>> + * removing dead catalog tuples below that threshold.
>> I think needs some rephrasing. We're not necessarily about to remove
>> catalog tuples here, nor are we necessarily advancing oldestCatalogXmin.
> Agreed

* We also record the value of procArray->replication_slot_catalog_xmin
* obtained from GetRunningTransactionData here. We intend to advance
* ShmemVariableCache->oldestCatalogXmin to it once standbys have been informed
* of the new value, which will permit removal of previously-protected dead
* catalog tuples. The standby needs to know about that before any WAL
* records containing such tuple removals could possibly arrive.

>>> +static void
>>> +UpdateOldestCatalogXmin(TransactionId pendingOldestCatalogXmin)
>>> +{
>>> + LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
>>> + if (TransactionIdPrecedes(ShmemVariableCache->oldestCatalogXmin, pendingOldestCatalogXmin)
>>> + || (TransactionIdIsValid(ShmemVariableCache->oldestCatalogXmin) != TransactionIdIsValid(pendingOldestCatalogXmin)))
>>> + ShmemVariableCache->oldestCatalogXmin = pendingOldestCatalogXmin;
>>> + LWLockRelease(ProcArrayLock);
>>> +}
>> Doing TransactionIdPrecedes before ensuring
>> ShmemVariableCache->oldestCatalogXmin is actually valid doesn't strike
>> me as a good idea. Generally, the expression as it stands is hard to
>> understand.
> OK.
> I found other formulations to be long and hard to read. Expressing it
> as "if validity has changed or value has increased" made more sense.
> Agree order should change.

Re-ordered, otherwise left the same.

Could add a comment like

"we must set oldestCatalogXmin if its validity has changed or it is advancing"

but seems rather redundant to the code.

>>> diff --git a/src/include/access/transam.h b/src/include/access/transam.h
>>> index d25a2dd..a4ecfb7 100644
>>> --- a/src/include/access/transam.h
>>> +++ b/src/include/access/transam.h
>>> @@ -136,6 +136,17 @@ typedef struct VariableCacheData
>>> * aborted */
>>> /*
>>> + * This field is protected by ProcArrayLock except
>>> + * during recovery, when it's set unlocked.
>>> + *
>>> + * oldestCatalogXmin is the oldest xid it is
>>> + * guaranteed to be safe to create a historic
>>> + * snapshot for. See also
>>> + * procArray->replication_slot_catalog_xmin
>>> + */
>>> + TransactionId oldestCatalogXmin;
>> Maybe it'd be better to rephrase that do something like
>> "oldestCatalogXmin guarantees that no valid catalog tuples >= than it
>> are removed. That property is used for logical decoding.". or similar?
> Fine by me.
> I'll adjust this per discussion and per a comment Simon made
> separately. Whether we use it right away or not it's worth having it
> updated while it's still freshly in mind.

OK, updated catalog_xmin logging patch attached.

Important fix included: when faking up a RunningTransactions snapshot
in StartupXLOG for replay of shutdown checkpoints, copy the
checkpoint's oldestCatalogXmin so we apply it instead of clobbering
the replica's value. It's kind of roundabout to set this once when we
apply the checkpoint and again via ProcArrayApplyRecoveryInfo, but
it's necessary if we're using xl_running_xacts to carry
oldestCatalogXmin info.

Found another issue too. We log our intention to increase
oldestCatalogXmin in LogStandbySnapshot when we write
xl_running_xacts. We then release ProcArrayLock to re-acquire it
LW_EXCLUSIVE so we can increment oldestCatalogXmin in shmem. But a
checkpoint runs and copies the old oldestCatalogXmin value after we
wrote xlog but before we updated in shmem. On the standby, redo will
apply the new value then clobber it with the old one.

To fix this, take CheckpointLock in LogStandbySnapshot (if not called
during a checkpoint) so we can't have the xl_running_xacts with the
new oldestCatalogXmin land up in WAL before a checkpoint with an older
value. Also take oldestCatalogXmin's value after we've forced
LogStandbySnapshot in a checkpoint.

Extended tests a bit to cover redo on standbys.

Personally I'm not a huge fan of how integrating this with logging
standby snapshots has turned out. It seemed to make sense initially,
but I think the way it works out is more convoluted than necessary for
little benefit. I'll prep an updated version of the
xl_advance_catalog_xmin patch with the same fixes for side by side

Craig Ringer
PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
log-catalog-xmin-advances-v4.patch text/x-patch 32.8 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2017-04-03 05:53:41 Re: Allow interrupts on waiting standby
Previous Message Rushabh Lathia 2017-04-03 05:26:16 Re: [COMMITTERS] pgsql: Avoid GatherMerge crash when there are no workers.