Re: Logical decoding on standby

From: Craig Ringer <craig(at)2ndquadrant(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Simon Riggs <simon(dot)riggs(at)2ndquadrant(dot)com>, Thom Brown <thom(at)linux(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Logical decoding on standby
Date: 2017-03-31 04:49:59
Message-ID: CAMsr+YHPz1V2WV87xGJAu4h4mUHD=hTpj1Tent5XhOJpOYXPPg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 31 March 2017 at 01:16, Andres Freund <andres(at)anarazel(dot)de> wrote:
>> @@ -9633,6 +9643,12 @@ xlog_redo(XLogReaderState *record)
>> SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
>>
>> /*
>> + * There can be no concurrent writers to oldestCatalogXmin during
>> + * recovery, so no need to take ProcArrayLock.
>> + */
>> + ShmemVariableCache->oldestCatalogXmin = checkPoint.oldestCatalogXmin;
>
> s/writers/writes/?

I meant writers, i.e. nothing else can be writing to it. But writes works too.

>> @@ -276,8 +279,21 @@ CreateInitDecodingContext(char *plugin,
>>
>> ReplicationSlotsComputeRequiredXmin(true);
>>
>> + /*
>> + * If this is the first slot created on the master we won't have a
>> + * persistent record of the oldest safe xid for historic snapshots yet.
>> + * Force one to be recorded so that when we go to replay from this slot we
>> + * know it's safe.
>> + */
>> + force_standby_snapshot =
>> + !TransactionIdIsValid(ShmemVariableCache->oldestCatalogXmin);
>
> s/first slot/first logical slot/. Also, the reference to master doesn't
> seem right.

Unsure what you mean re reference to master not seeming right.

If oldestCatalogXmin is 0 we'll ERROR when trying to start decoding
from the new slot so we need to make sure it gets advanced one we've
decided on our starting catalog_xmin.

>> LWLockRelease(ProcArrayLock);
>>
>> + /* Update ShmemVariableCache->oldestCatalogXmin */
>> + if (force_standby_snapshot)
>> + LogStandbySnapshot();
>
> The comment and code don't quite square to me - it's far from obvious
> that LogStandbySnapshot does something like that. I'd even say it's a
> bad idea to have it do that.

So you prefer the prior approach with separate xl_catalog_xmin advance records?

I don't have much preference; I liked the small code reduction of
Simon's proposed approach, but it landed up being a bit awkward in
terms of ordering and locking. I don't think catalog_xmin tracking is
really closely related to the standby snapshot stuff and it feels a
bit like it's a tacked-on afterthought where it is now.

>> /*
>> * tell the snapshot builder to only assemble snapshot once reaching the
>> * running_xact's record with the respective xmin.
>> @@ -376,6 +392,8 @@ CreateDecodingContext(XLogRecPtr start_lsn,
>> start_lsn = slot->data.confirmed_flush;
>> }
>>
>> + EnsureActiveLogicalSlotValid();
>
> This seems like it should be in a separate patch, and seperately
> reviewed. It's code that's currently unreachable (and thus untestable).

It is reached and is run, those checks run whenever creating a
non-initial decoding context on master or replica.

The failure case is reachable if a replica has hot_standby_feedback
off or it's not using a physical slot and loses its connection. If
promoted, any slot existing on that replica (from a file system level
copy when the replica was created) will fail. I agree it's contrived
since we can't create and maintain slots on replicas, which is the
main point of it.

>> +/*
>> + * Test to see if the active logical slot is usable.
>> + */
>> +static void
>> +EnsureActiveLogicalSlotValid(void)
>> +{
>> + TransactionId shmem_catalog_xmin;
>> +
>> + Assert(MyReplicationSlot != NULL);
>> +
>> + /*
>> + * A logical slot can become unusable if we're doing logical decoding on a
>> + * standby or using a slot created before we were promoted from standby
>> + * to master.
>
> Neither of those is currently possible...

Right. Because it's foundations for decoding on standby.

>> If the master advanced its global catalog_xmin past the
>> + * threshold we need it could've removed catalog tuple versions that
>> + * we'll require to start decoding at our restart_lsn.
>> + *
>> + * We need a barrier so that if we decode in recovery on a standby we
>> + * don't allow new decoding sessions to start after redo has advanced
>> + * the threshold.
>> + */
>> + if (RecoveryInProgress())
>> + pg_memory_barrier();
>
> I don't think this is a meaningful locking protocol. It's a bad idea to
> use lock-free programming without need, especially when the concurrency
> protocol isn't well defined.

Yeah. The intended interaction is with recovery conflict on standby
which doesn't look likely to land in this release due to patch
split/cleanup etc. (Not a complaint).

Better to just take a brief shared ProcArrayLock.

>> diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
>> index cfc3fba..cdc5f95 100644
>> --- a/src/backend/replication/walsender.c
>> +++ b/src/backend/replication/walsender.c
>> @@ -1658,6 +1658,11 @@ PhysicalConfirmReceivedLocation(XLogRecPtr lsn)
>> * be energy wasted - the worst lost information can do here is give us
>> * wrong information in a statistics view - we'll just potentially be more
>> * conservative in removing files.
>> + *
>> + * We don't have to do any effective_xmin / effective_catalog_xmin testing
>> + * here either, like for LogicalConfirmReceivedLocation. If we received
>> + * the xmin and catalog_xmin from downstream replication slots we know they
>> + * were already confirmed there,
>> */
>> }
>
> This comment reads as if LogicalConfirmReceivedLocation had
> justification for not touching / checking catalog_xmin. But it does.

It touches it, what it doesn't do is test and only advance if the new
value is greater, like for xmin as referenced in the prior par. Will
clarify.

>> /*
>> + * Update our knowledge of the oldest xid we can safely create historic
>> + * snapshots for.
>> + *
>> + * There can be no concurrent writers to oldestCatalogXmin during
>> + * recovery, so no need to take ProcArrayLock.
>
> By now I think is pretty flawed logic, because there can be concurrent
> readers, that need to be safe against oldestCatalogXmin advancing
> concurrently.

You're right, we'll need a lock or suitable barriers here to ensure
that slot conflict with recovery and startup of new decoding sessions
doesn't see outdated values. This would be the peer of the
pg_memory_barrier() above. Or could just take a lock; there's enough
other locking activity in redo that it should be fine.

>> /*
>> - * It's important *not* to include the limits set by slots here because
>> + * It's important *not* to include the xmin set by slots here because
>> * snapbuild.c uses oldestRunningXid to manage its xmin horizon. If those
>> * were to be included here the initial value could never increase because
>> * of a circular dependency where slots only increase their limits when
>> * running xacts increases oldestRunningXid and running xacts only
>> * increases if slots do.
>> + *
>> + * We can include the catalog_xmin limit here; there's no similar
>> + * circularity, and we need it to log xl_running_xacts records for
>> + * standbys.
>> */
>
> Those comments seem to need some more heavyhanded reconciliation.

OK. To me it seems clear that the first refers to xmin, the second to
catalog_xmin. But after all I wrote it, and the important thing is
what it says to people who are not me. Will adjust.

>> *
>> * Return the current slot xmin limits. That's useful to be able to remove
>> * data that's older than those limits.
>> + *
>> + * For logical replication slots' catalog_xmin, we return both the effective
>
> This seems to need some light editing.

catalog_xmin => catalog_xmins I guess.

>> /*
>> * Record an enhanced snapshot of running transactions into WAL.
>> *
>> + * We also record the value of procArray->replication_slot_catalog_xmin
>> + * obtained from GetRunningTransactionData here, so standbys know we're about
>> + * to advance ShmemVariableCache->oldestCatalogXmin to its value and start
>> + * removing dead catalog tuples below that threshold.
>
> I think needs some rephrasing. We're not necessarily about to remove
> catalog tuples here, nor are we necessarily advancing oldestCatalogXmin.

Agreed

>> +static void
>> +UpdateOldestCatalogXmin(TransactionId pendingOldestCatalogXmin)
>> +{
>> + LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
>> + if (TransactionIdPrecedes(ShmemVariableCache->oldestCatalogXmin, pendingOldestCatalogXmin)
>> + || (TransactionIdIsValid(ShmemVariableCache->oldestCatalogXmin) != TransactionIdIsValid(pendingOldestCatalogXmin)))
>> + ShmemVariableCache->oldestCatalogXmin = pendingOldestCatalogXmin;
>> + LWLockRelease(ProcArrayLock);
>> +}
>
> Doing TransactionIdPrecedes before ensuring
> ShmemVariableCache->oldestCatalogXmin is actually valid doesn't strike
> me as a good idea. Generally, the expression as it stands is hard to
> understand.

OK.

I found other formulations to be long and hard to read. Expressing it
as "if validity has changed or value has increased" made more sense.
Agree order should change.

>> diff --git a/src/include/access/transam.h b/src/include/access/transam.h
>> index d25a2dd..a4ecfb7 100644
>> --- a/src/include/access/transam.h
>> +++ b/src/include/access/transam.h
>> @@ -136,6 +136,17 @@ typedef struct VariableCacheData
>> * aborted */
>>
>> /*
>> + * This field is protected by ProcArrayLock except
>> + * during recovery, when it's set unlocked.
>> + *
>> + * oldestCatalogXmin is the oldest xid it is
>> + * guaranteed to be safe to create a historic
>> + * snapshot for. See also
>> + * procArray->replication_slot_catalog_xmin
>> + */
>> + TransactionId oldestCatalogXmin;
>
> Maybe it'd be better to rephrase that do something like
> "oldestCatalogXmin guarantees that no valid catalog tuples >= than it
> are removed. That property is used for logical decoding.". or similar?

Fine by me.

I'll adjust this per discussion and per a comment Simon made
separately. Whether we use it right away or not it's worth having it
updated while it's still freshly in mind.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Langote 2017-03-31 04:51:59 Re: postgres_fdw IMPORT SCHEMA and partitioned tables
Previous Message Michael Paquier 2017-03-31 04:46:28 Re: postgres_fdw IMPORT SCHEMA and partitioned tables