Re: logical decoding and replication of sequences

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: Petr Jelinek <petr(dot)jelinek(at)enterprisedb(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: logical decoding and replication of sequences
Date: 2022-04-05 10:06:10
Message-ID: CAA4eK1K=AFtDfH5nv+8TvsUbwWKMMdi4ajLwJb7=Xw=R6et2WQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Apr 4, 2022 at 3:10 AM Tomas Vondra
<tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>
> I did some experiments over the weekend, exploring how to rework the
> sequence decoding in various ways. Let me share some WIP patches,
> hopefully that can be useful for trying more stuff and moving this
> discussion forward.
>
> I tried two things - (1) accumulating sequence increments in global
> array and then doing something with it, and (2) treating all sequence
> increments as regular changes (in a TXN) and then doing something
> special during the replay. Attached are two patchsets, one for each
> approach.
>
> Note: It's important to remember decoding of sequences is not the only
> code affected by this. The logical messages have the same issue,
> certainly when it comes to transactional vs. non-transactional stuff and
> handling of snapshots. Even if the sequence decoding ends up being
> reverted, we still need to fix that, somehow. And my feeling is the
> solutions ought to be pretty similar in both cases.
>
> Now, regarding the two approaches:
>
> (1) accumulating sequences in global hash table
>
> The main problem with regular sequence increments is that those need to
> be non-transactional - a transaction may use a sequence without any
> WAL-logging, if the WAL was written by an earlier transaction. The
> problem is the earlier trasaction might have been rolled back, and thus
> simply discarded by the logical decoding. But we still need to apply
> that, in order not to lose the sequence increment.
>
> The current code just applies those non-transactional increments right
> after decoding the increment, but that does not work because we may not
> have a snapshot at that point. And we only have the snapshot when within
> a transaction (AFAICS) so this queues all changes and then applies the
> changes later.
>
> The changes need to be shared by all transactions, so queueing them in a
> global works fairly well - otherwise we'd have to walk all transactions,
> in order to see if there are relevant sequence increments.
>
> But some increments may be transactional, e.g. when the sequence is
> created or altered in a transaction. To allow tracking this, this uses a
> hash table, with relfilenode as a key.
>
> There's a couple issues with this, though. Firstly, stashing the changes
> outside transactions, it's not included in memory accounting, it's not
> spilled to disk or streamed, etc. I guess fixing this is possible, but
> it's certainly not straightforward, because we mix increments from many
> different transactions.
>
> A bigger issue is that I'm not sure this actually handles the snapshots
> correctly either.
>
> The non-transactional increments affect all transactions, so when
> ReorderBufferProcessSequences gets executed, it processes all of them,
> no matter the source transaction. Can we be sure the snapshot in the
> applying transaction is the same (or "compatible") as the snapshot in
> the source transaction?
>

I don't think we can assume that. I think it is possible that some
other transaction's WAL can be in-between start/end lsn of txn (which
we decide to send) which may not finally reach a consistent state.
Consider a case similar to shown in one of my previous emails:
Session-2:
Begin;
SELECT pg_current_xact_id();

Session-1:
SELECT 'init' FROM pg_create_logical_replication_slot('test_slot',
'test_decoding', false, true);

Session-3:
Begin;
SELECT pg_current_xact_id();

Session-2:
Commit;
Begin;
INSERT INTO t1_seq SELECT nextval('seq1') FROM generate_series(1,100);

Session-3:
Commit;

Session-2:
Commit;

Here, we send changes (say insert from txn 700) from session-2 because
session-3's commit happens before it. Now, consider another
transaction parallel to txn 700 which generates some WAL related to
sequences but it committed before session-3's commit. So though, its
changes will be the in-between start/end LSN of txn 700 but those
shouldn't be sent.

I have not tried this and also this may be solvable in some way but I
think processing changes from other TXNs sounds risky to me in terms
of snapshot handling.

>
>
> (2) treating sequence change as regular changes
>
> This adopts a different approach - instead of accumulating the sequence
> increments in a global hash table, it treats them as regular changes.
> Which solves the snapshot issue, and issues with spilling to disk,
> streaming and so on.
>
> But it has various other issues with handling concurrent transactions,
> unfortunately, which probably make this approach infeasible:
>
> * The non-transactional stuff has to be applied in the first transaction
> that commits, not in the transaction that generated the WAL. That does
> not work too well with this approach, because we have to walk changes in
> all other transactions.
>

Why do you want to traverse other TXNs in this approach? Is it because
the current TXN might be using some value of sequence which has been
actually WAL logged in the other transaction but that other
transaction has not been sent yet? I think if we don't send that then
probably replica sequences columns (in some tables) have some values
but actually the sequence itself won't have still that value which
sounds problematic. Is that correct?

> * Another serious issue seems to be streaming - if we already streamed
> some of the changes, we can't iterate through them anymore.
>
> Also, having to walk the transactions over and over for each change, to
> apply relevant sequence increments, that's mighty expensive. The other
> approach needs to do that too, but walking the global hash table seems
> much cheaper.
>
> The other issue this handling of aborted transactions - we need to apply
> sequence increments even from those transactions, of course. The other
> approach has this issue too, though.
>
>
> (3) tracking sequences touched by transaction
>
> This is the approach proposed by Hannu Krosing. I haven't explored this
> again yet, but I recall I wrote a PoC patch a couple months back.
>
> It seems to me most of the problems stems from trying to derive sequence
> state from decoded WAL changes, which is problematic because of the
> non-transactional nature of sequences (i.e. WAL for one transaction
> affects other transactions in non-obvious ways). And this approach
> simply works around that entirely - instead of trying to deduce the
> sequence state from WAL, we'd make sure to write the current sequence
> state (or maybe just ID of the sequence) at commit time. Which should
> eliminate most of the complexity / problems, I think.
>

That sounds promising but I haven't thought in detail about that approach.

>
> I'm not really sure what to do about this. All of those reworks seems
> like an extensive redesign of the patch, and considering the last CF is
> already over ... not great.
>

Yeah, I share the same feeling that even if we devise solutions to all
the known problems it requires quite some time to ensure everything is
correct.

--
With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dilip Kumar 2022-04-05 10:19:09 Re: should vacuum's first heap pass be read-only?
Previous Message Alvaro Herrera 2022-04-05 10:00:35 Re: generic plans and "initial" pruning