Re: snapbuild woes

From: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Erik Rijkers <er(at)xs4all(dot)nl>, Robert Haas <robertmhaas(at)gmail(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>
Subject: Re: snapbuild woes
Date: 2017-04-15 03:18:49
Message-ID: 30242bc6-eca4-b7bb-670e-8d0458753a8c@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi, here is updated patch (details inline).

On 13/04/17 20:00, Petr Jelinek wrote:
> Thanks for looking at this!
>
> On 13/04/17 02:29, Andres Freund wrote:
>> Hi,
>> On 2017-03-03 01:30:11 +0100, Petr Jelinek wrote:
>>
>>> From 7d5b48c8cb80e7c867b2096c999d08feda50b197 Mon Sep 17 00:00:00 2001
>>> From: Petr Jelinek <pjmodos(at)pjmodos(dot)net>
>>> Date: Fri, 24 Feb 2017 21:39:03 +0100
>>> Subject: [PATCH 1/5] Reserve global xmin for create slot snasphot export
>>>
>>> Otherwise the VACUUM or pruning might remove tuples still needed by the
>>> exported snapshot.
>>> ---
>>> src/backend/replication/logical/logical.c | 21 ++++++++++++++++++++-
>>> 1 file changed, 20 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
>>> index 5529ac8..57c392c 100644
>>> --- a/src/backend/replication/logical/logical.c
>>> +++ b/src/backend/replication/logical/logical.c
>>> @@ -267,12 +267,18 @@ CreateInitDecodingContext(char *plugin,
>>> * the slot machinery about the new limit. Once that's done the
>>> * ProcArrayLock can be released as the slot machinery now is
>>> * protecting against vacuum.
>>> + *
>>> + * Note that we only store the global xmin temporarily in the in-memory
>>> + * state so that the initial snapshot can be exported. After initial
>>> + * snapshot is done global xmin should be reset and not tracked anymore
>>> + * so we are fine with losing the global xmin after crash.
>>> * ----
>>> */
>>> LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
>>>
>>> slot->effective_catalog_xmin = GetOldestSafeDecodingTransactionId();
>>> slot->data.catalog_xmin = slot->effective_catalog_xmin;
>>> + slot->effective_xmin = slot->effective_catalog_xmin;
>>
>>
>>> void
>>> FreeDecodingContext(LogicalDecodingContext *ctx)
>>> {
>>> + ReplicationSlot *slot = MyReplicationSlot;
>>> +
>>> if (ctx->callbacks.shutdown_cb != NULL)
>>> shutdown_cb_wrapper(ctx);
>>>
>>> + /*
>>> + * Cleanup global xmin for the slot that we may have set in
>>> + * CreateInitDecodingContext().
>>
>> Hm. Is that actually a meaningful point to do so? For one, it gets
>> called by pg_logical_slot_get_changes_guts(), but more importantly, the
>> snapshot is exported till SnapBuildClearExportedSnapshot(), which is the
>> next command? If we rely on the snapshot magic done by ExportSnapshot()
>> it'd be worthwhile to mention that...
>>
>
> (Didn't see the patch for couple of months so don't remember all the
> detailed decisions anymore)
>
> Yes we rely on backend's xmin to be set for exported snapshot. We only
> care about global xmin for exported snapshots really, I assumed that's
> clear enough from "so that the initial snapshot can be exported" but I
> guess there should be more clear comment about this where we actually
> clean this up.
>

Okay, wrote new comment there, how is it now?

>>
>>> We do not take ProcArrayLock or similar
>>> + * since we only reset xmin here and there's not much harm done by a
>>> + * concurrent computation missing that.
>>> + */
>>
>> Hum. I was prepared to complain about this, but ISTM, that there's
>> absolutely no risk because the following
>> ReplicationSlotsComputeRequiredXmin(false); actually does all the
>> necessary locking? But still, I don't see much point in the
>> optimization.
>>
>
> Well, if we don't need it in LogicalConfirmReceivedLocation, I don't see
> why we need it here. Please enlighten me.
>

I kept this as it was, after rereading, the
ReplicationSlotsComputeRequiredXmin(false) will do shared lock while if
we wanted to avoid mutex and do the xmin update under lock we'd need to
do exclusive lock so I think it's worth the optimization...

>>
>>
>>> This patch changes the code so that stored snapshots are only used for
>>> logical decoding restart but not for initial slot snapshot.
>>
>> Yea, that's a very good point...
>>
>>> @@ -1284,13 +1286,13 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
>>>
>>> return false;
>>> }
>>> - /* c) valid on disk state */
>>> - else if (SnapBuildRestore(builder, lsn))
>>> + /* c) valid on disk state and not exported snapshot */
>>> + else if (!TransactionIdIsNormal(builder->initial_xmin_horizon) &&
>>> + SnapBuildRestore(builder, lsn))
>>
>> Hm. Is this a good signaling mechanism? It'll also trigger for the SQL
>> interface, where it'd strictly speaking not be required, right?
>>
>
> Good point. Maybe we should really tell snapshot builder if the snapshot
> is going to be exported or not explicitly (see the rant all the way down).
>

I added the new signaling mechanism (the new boolean option indicating
if we are building full snapshot which is only set when the snapshot is
exported or used by the transaction).

>>
>>> From 3318a929e691870f3c1ca665bec3bfa8ea2af2a8 Mon Sep 17 00:00:00 2001
>>> From: Petr Jelinek <pjmodos(at)pjmodos(dot)net>
>>> Date: Sun, 26 Feb 2017 01:07:33 +0100
>>> Subject: [PATCH 3/5] Prevent snapshot builder xmin from going backwards
>>
>> A bit more commentary would be good. What does that protect us against?
>>
>
> I think I explained that in the email. We might export snapshot with
> xmin smaller than global xmin otherwise.
>

Updated commit message with explanation as well.

>>
>>
>>> From 53193b40f26dd19c712f3b9b77af55f81eb31cc4 Mon Sep 17 00:00:00 2001
>>> From: Petr Jelinek <pjmodos(at)pjmodos(dot)net>
>>> Date: Wed, 22 Feb 2017 00:57:33 +0100
>>> Subject: [PATCH 4/5] Fix xl_running_xacts usage in snapshot builder
>>>
>>> Due to race condition, the xl_running_xacts might contain no longer
>>> running transactions. Previous coding tried to get around this by
>>> additional locking but that did not work correctly for committs. Instead
>>> try combining decoded commits and multiple xl_running_xacts to get the
>>> consistent snapshot.
>>
>> Needs more explanation about approach.
>>
>>
>>> @@ -1221,7 +1221,12 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
>>> * simply track the number of in-progress toplevel transactions and
>>> * lower it whenever one commits or aborts. When that number
>>> * (builder->running.xcnt) reaches zero, we can go from FULL_SNAPSHOT
>>> - * to CONSISTENT.
>>> + * to CONSISTENT. Sometimes we might get xl_running_xacts which has
>>> + * all tracked transactions as finished. We'll need to restart tracking
>>> + * in that case and use previously collected committed transactions to
>>> + * purge transactions mistakenly marked as running in the
>>> + * xl_running_xacts which exist as a result of race condition in
>>> + * LogStandbySnapshot().
>>
>> I'm not following this yet.
>>
>
> Let me try to explain:
> We get xl_running_xacts with txes 1,3,4. But the 1 already committed
> before so the decoding will never see it and we never get snapshot. Now
> at some point we might get xl_running_xact with txes 6,7,8 so we know
> that all transactions from the initial xl_running_xacts must be closed.
> We restart the tracking here from beginning as if this was first
> xl_running_xacts we've seen, with the exception that we look into past
> if we seen the 6,7,8 transactions already as finished, then we'll mark
> them as finished immediately (hence avoiding the issue with transaction
> 6 being already committed before xl_running_xacts was written).
>
>>
>>> @@ -1298,11 +1303,17 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
>>> * b) first encounter of a useable xl_running_xacts record. If we had
>>> * found one earlier we would either track running transactions (i.e.
>>> * builder->running.xcnt != 0) or be consistent (this function wouldn't
>>> - * get called).
>>> + * get called). However it's possible that we could not see all
>>> + * transactions that were marked as running in xl_running_xacts, so if
>>> + * we get new one that says all were closed but we are not consistent
>>> + * yet, we need to restart the tracking while taking previously seen
>>> + * transactions into account.
>>
>> This needs to revise the preceding comment more heavily. "This is the
>> first!!! Or maybe not!" isn't easy to understand.
>>
>
> Yeah, I found it bit hard to make it sound correct and not confusing,
> even wondered if I should split this code to two because of that but it
> would lead into quite a bit of code duplication, dunno if that's better.
> Maybe we could move the "reset" code into separate function to avoid
> most of the duplication.

Rewrote and moved this comment to it's own thing.

>
>>
>>> */
>>> - else if (!builder->running.xcnt)
>>> + else if (!builder->running.xcnt ||
>>> + running->oldestRunningXid > builder->running.xmax)
>>
>> Isn't that wrong under wraparound?
>>
>
> Right, should use TransactionIdFollows.
>

Fixed.

>>
>>> {
>>> int off;
>>> + bool first = builder->running.xcnt == 0;
>>>
>>> /*
>>> * We only care about toplevel xids as those are the ones we
>>> @@ -1338,20 +1349,13 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
>>> builder->running.xmin = builder->running.xip[0];
>>> builder->running.xmax = builder->running.xip[running->xcnt - 1];
>>>
>>> +
>>> /* makes comparisons cheaper later */
>>> TransactionIdRetreat(builder->running.xmin);
>>> TransactionIdAdvance(builder->running.xmax);
>>>
>>> builder->state = SNAPBUILD_FULL_SNAPSHOT;
>>>
>>> - ereport(LOG,
>>> - (errmsg("logical decoding found initial starting point at %X/%X",
>>> - (uint32) (lsn >> 32), (uint32) lsn),
>>> - errdetail_plural("%u transaction needs to finish.",
>>> - "%u transactions need to finish.",
>>> - builder->running.xcnt,
>>> - (uint32) builder->running.xcnt)));
>>> -
>>> + /*
>>> + * If this is the first time we've seen xl_running_xacts, we are done.
>>> + */
>>> + if (first)
>>> + {
>>> + ereport(LOG,
>>> + (errmsg("logical decoding found initial starting point at %X/%X",
>>> + (uint32) (lsn >> 32), (uint32) lsn),
>>> + errdetail_plural("%u transaction needs to finish.",
>>> + "%u transactions need to finish.",
>>> + builder->running.xcnt,
>>> + (uint32) builder->running.xcnt)));
>>> + }
>>> + else
>>> + {
>>> + /*
>>> + * Because of the race condition in LogStandbySnapshot() the
>>> + * transactions recorded in xl_running_xacts as running might have
>>> + * already committed by the time the xl_running_xacts was written
>>> + * to WAL. Use the information about decoded transactions that we
>>> + * gathered so far to update our idea about what's still running.
>>> + *
>>> + * We can use SnapBuildEndTxn directly as it only does the
>>> + * transaction running check and handling without any additional
>>> + * side effects.
>>> + */
>>> + for (off = 0; off < builder->committed.xcnt; off++)
>>> + SnapBuildEndTxn(builder, lsn, builder->committed.xip[off]);
>>> + if (builder->state == SNAPBUILD_CONSISTENT)
>>> + return false;
>>> +
>>> + ereport(LOG,
>>> + (errmsg("logical decoding moved initial starting point to %X/%X",
>>> + (uint32) (lsn >> 32), (uint32) lsn),
>>> + errdetail_plural("%u transaction needs to finish.",
>>> + "%u transactions need to finish.",
>>> + builder->running.xcnt,
>>> + (uint32) builder->running.xcnt)));
>>> + }
>>
>> Hm, this is not pretty.
>>
>

Changed this whole thing to be 2 different code paths with common
function doing the common work.

>
>>
>>> From 4217da872e9aa48750c020542d8bc22c863a3d75 Mon Sep 17 00:00:00 2001
>>> From: Petr Jelinek <pjmodos(at)pjmodos(dot)net>
>>> Date: Tue, 21 Feb 2017 19:58:18 +0100
>>> Subject: [PATCH 5/5] Skip unnecessary snapshot builds
>>>
>>> When doing initial snapshot build during logical decoding
>>> initialization, don't build snapshots for transactions where we know the
>>> transaction didn't do any catalog changes. Otherwise we might end up
>>> with thousands of useless snapshots on busy server which can be quite
>>> expensive.
>>> ---
>>> src/backend/replication/logical/snapbuild.c | 82 +++++++++++++++++++----------
>>> 1 file changed, 53 insertions(+), 29 deletions(-)
>>>
>>> diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
>>> index 1a1c9ba..c800aa5 100644
>>> --- a/src/backend/replication/logical/snapbuild.c
>>> +++ b/src/backend/replication/logical/snapbuild.c
>>> @@ -954,6 +954,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
>>> bool forced_timetravel = false;
>>> bool sub_needs_timetravel = false;
>>> bool top_needs_timetravel = false;
>>> + bool skip_forced_snapshot = false;
>>>
>>> TransactionId xmax = xid;
>>>
>>> @@ -975,10 +976,19 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
>>> /*
>>> * We could avoid treating !SnapBuildTxnIsRunning transactions as
>>> * timetravel ones, but we want to be able to export a snapshot when
>>> - * we reached consistency.
>>> + * we reached consistency so we need to keep track of them.
>>> */
>>> forced_timetravel = true;
>>> elog(DEBUG1, "forced to assume catalog changes for xid %u because it was running too early", xid);
>>> +
>>> + /*
>>> + * It is however desirable to skip building new snapshot for
>>> + * !SnapBuildTxnIsRunning transactions as otherwise we might end up
>>> + * building thousands of unused snapshots on busy servers which can
>>> + * be very expensive.
>>> + */
>>> + if (!SnapBuildTxnIsRunning(builder, xid))
>>> + skip_forced_snapshot = true;
>>> }
>>
>> That's pretty crudely bolted on the existing logic, isn't there a
>> simpler way?
>>
>
> Agreed, however, every time I tried to make this prettier I ended up
> either producing subtle bugs (see the initial email in this thread for
> example of that), so I eventually gave up on pretty.
>

Okay, gave it one more try with fresh head, hopefully without new bugs,
what do you think?

> As a side note, my opinion after all this is that it's probably mistake
> to try to use various situational conditions to make sure we are
> building exportable snapshot. ISTM we should tell the snapbuilder
> explicitly that the snapshot will be exported and it should behave
> accordingly based on that. Because for example, we also should track
> aborted transactions in the snapshot which is to be exported because
> otherwise enough of them happening during snapshot building will result
> in export failure due to too big snapshot. But this seems like too
> invasive change to be back-portable.
>

Ended up doing this in the 0002 and also use those changes in 0005, does
not seem to be that bad.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch text/plain 2.9 KB
0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch text/plain 8.6 KB
0003-Prevent-snapshot-builder-xmin-from-going-backwards.patch text/plain 1.5 KB
0004-Fix-xl_running_xacts-usage-in-snapshot-builder.patch text/plain 12.5 KB
0005-Skip-unnecessary-snapshot-builds.patch text/plain 6.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Noah Misch 2017-04-15 03:58:23 Re: Quorum commit for multiple synchronous replication.
Previous Message Erik Rijkers 2017-04-15 02:47:01 Logical replication - TRAP: FailedAssertion in pgstat.c