Re: logical decoding and replication of sequences

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Petr Jelinek <petr(dot)jelinek(at)enterprisedb(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: logical decoding and replication of sequences
Date: 2022-04-07 18:34:50
Message-ID: 1882b44e-8abd-8f2f-29a3-21453801db15@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 4/6/22 16:13, Tomas Vondra wrote:
>
>
> On 4/5/22 12:06, Amit Kapila wrote:
>> On Mon, Apr 4, 2022 at 3:10 AM Tomas Vondra
>> <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>>>
>>> I did some experiments over the weekend, exploring how to rework the
>>> sequence decoding in various ways. Let me share some WIP patches,
>>> hopefully that can be useful for trying more stuff and moving this
>>> discussion forward.
>>>
>>> I tried two things - (1) accumulating sequence increments in global
>>> array and then doing something with it, and (2) treating all sequence
>>> increments as regular changes (in a TXN) and then doing something
>>> special during the replay. Attached are two patchsets, one for each
>>> approach.
>>>
>>> Note: It's important to remember decoding of sequences is not the only
>>> code affected by this. The logical messages have the same issue,
>>> certainly when it comes to transactional vs. non-transactional stuff and
>>> handling of snapshots. Even if the sequence decoding ends up being
>>> reverted, we still need to fix that, somehow. And my feeling is the
>>> solutions ought to be pretty similar in both cases.
>>>
>>> Now, regarding the two approaches:
>>>
>>> (1) accumulating sequences in global hash table
>>>
>>> The main problem with regular sequence increments is that those need to
>>> be non-transactional - a transaction may use a sequence without any
>>> WAL-logging, if the WAL was written by an earlier transaction. The
>>> problem is the earlier trasaction might have been rolled back, and thus
>>> simply discarded by the logical decoding. But we still need to apply
>>> that, in order not to lose the sequence increment.
>>>
>>> The current code just applies those non-transactional increments right
>>> after decoding the increment, but that does not work because we may not
>>> have a snapshot at that point. And we only have the snapshot when within
>>> a transaction (AFAICS) so this queues all changes and then applies the
>>> changes later.
>>>
>>> The changes need to be shared by all transactions, so queueing them in a
>>> global works fairly well - otherwise we'd have to walk all transactions,
>>> in order to see if there are relevant sequence increments.
>>>
>>> But some increments may be transactional, e.g. when the sequence is
>>> created or altered in a transaction. To allow tracking this, this uses a
>>> hash table, with relfilenode as a key.
>>>
>>> There's a couple issues with this, though. Firstly, stashing the changes
>>> outside transactions, it's not included in memory accounting, it's not
>>> spilled to disk or streamed, etc. I guess fixing this is possible, but
>>> it's certainly not straightforward, because we mix increments from many
>>> different transactions.
>>>
>>> A bigger issue is that I'm not sure this actually handles the snapshots
>>> correctly either.
>>>
>>> The non-transactional increments affect all transactions, so when
>>> ReorderBufferProcessSequences gets executed, it processes all of them,
>>> no matter the source transaction. Can we be sure the snapshot in the
>>> applying transaction is the same (or "compatible") as the snapshot in
>>> the source transaction?
>>>
>>
>> I don't think we can assume that. I think it is possible that some
>> other transaction's WAL can be in-between start/end lsn of txn (which
>> we decide to send) which may not finally reach a consistent state.
>> Consider a case similar to shown in one of my previous emails:
>> Session-2:
>> Begin;
>> SELECT pg_current_xact_id();
>>
>> Session-1:
>> SELECT 'init' FROM pg_create_logical_replication_slot('test_slot',
>> 'test_decoding', false, true);
>>
>> Session-3:
>> Begin;
>> SELECT pg_current_xact_id();
>>
>> Session-2:
>> Commit;
>> Begin;
>> INSERT INTO t1_seq SELECT nextval('seq1') FROM generate_series(1,100);
>>
>> Session-3:
>> Commit;
>>
>> Session-2:
>> Commit;
>>
>> Here, we send changes (say insert from txn 700) from session-2 because
>> session-3's commit happens before it. Now, consider another
>> transaction parallel to txn 700 which generates some WAL related to
>> sequences but it committed before session-3's commit. So though, its
>> changes will be the in-between start/end LSN of txn 700 but those
>> shouldn't be sent.
>>
>> I have not tried this and also this may be solvable in some way but I
>> think processing changes from other TXNs sounds risky to me in terms
>> of snapshot handling.
>>
>
> Yes, I know this can happen. I was only really thinking about what might
> happen to the relfilenode of the sequence itself - and I don't think any
> concurrent transaction could swoop in and change the relfilenode in any
> meaningful way, due to locking.
>
> But of course, if we expect/require to have a perfect snapshot for that
> exact position in the transaction, this won't work. IMO the whole idea
> that we can have non-transactional bits in naturally transactional
> decoding seems a bit suspicious (at least in hindsight).
>
> No matter what we do for sequences, though, this still affects logical
> messages too. Not sure what to do there :-(
>
>>>
>>>
>>> (2) treating sequence change as regular changes
>>>
>>> This adopts a different approach - instead of accumulating the sequence
>>> increments in a global hash table, it treats them as regular changes.
>>> Which solves the snapshot issue, and issues with spilling to disk,
>>> streaming and so on.
>>>
>>> But it has various other issues with handling concurrent transactions,
>>> unfortunately, which probably make this approach infeasible:
>>>
>>> * The non-transactional stuff has to be applied in the first transaction
>>> that commits, not in the transaction that generated the WAL. That does
>>> not work too well with this approach, because we have to walk changes in
>>> all other transactions.
>>>
>>
>> Why do you want to traverse other TXNs in this approach? Is it because
>> the current TXN might be using some value of sequence which has been
>> actually WAL logged in the other transaction but that other
>> transaction has not been sent yet? I think if we don't send that then
>> probably replica sequences columns (in some tables) have some values
>> but actually the sequence itself won't have still that value which
>> sounds problematic. Is that correct?
>>
>
> Well, how else would you get to sequence changes in the other TXNs?
>
> Consider this:
>
> T1: begin
> T2: begin
>
> T2: nextval('s') -> writes WAL for 32 values
> T1: nextval('s') -> gets value without WAL
>
> T1: commit
> T2: commit
>
> Now, if we commit T1 without "applying" the sequence change from T2, we
> loose the sequence state. But we still write/replicate the value
> generated from the sequence.
>
>>> * Another serious issue seems to be streaming - if we already streamed
>>> some of the changes, we can't iterate through them anymore.
>>>
>>> Also, having to walk the transactions over and over for each change, to
>>> apply relevant sequence increments, that's mighty expensive. The other
>>> approach needs to do that too, but walking the global hash table seems
>>> much cheaper.
>>>
>>> The other issue this handling of aborted transactions - we need to apply
>>> sequence increments even from those transactions, of course. The other
>>> approach has this issue too, though.
>>>
>>>
>>> (3) tracking sequences touched by transaction
>>>
>>> This is the approach proposed by Hannu Krosing. I haven't explored this
>>> again yet, but I recall I wrote a PoC patch a couple months back.
>>>
>>> It seems to me most of the problems stems from trying to derive sequence
>>> state from decoded WAL changes, which is problematic because of the
>>> non-transactional nature of sequences (i.e. WAL for one transaction
>>> affects other transactions in non-obvious ways). And this approach
>>> simply works around that entirely - instead of trying to deduce the
>>> sequence state from WAL, we'd make sure to write the current sequence
>>> state (or maybe just ID of the sequence) at commit time. Which should
>>> eliminate most of the complexity / problems, I think.
>>>
>>
>> That sounds promising but I haven't thought in detail about that approach.
>>
>
> So, here's a patch doing that. It's a reworked/improved version of the
> patch [1] shared in November.
>
> It seems to be working pretty nicely. The behavior is a little bit
> different, of course, because we only replicate "committed" changes, so
> if you do nextval() in aborted transaction that is not replicated. Which
> I think is fine, because we generally make no durability guarantees for
> aborted transactions in general.
>
> But there are a couple issues too:
>
> 1) locking
>
> We have to read sequence change before the commit, but we must not allow
> reordering (because then the state might go backwards again). I'm not
> sure how serious impact could this have on performance.
>
> 2) dropped sequences
>
> I'm not sure what to do about sequences dropped in the transaction. The
> patch simply attempts to read the current sequence state before the
> commit, but if the sequence was dropped (in that transaction), that
> can't happen. I'm not sure if that's OK or not.
>
> 3) WAL record
>
> To replicate the stuff the patch uses a LogicalMessage, but I guess a
> separate WAL record would be better. But that's a technical detail.
>
>
> regards
>
> [1]
> https://www.postgresql.org/message-id/2cd38bab-c874-8e0b-98e7-d9abaaf9806a@enterprisedb.com
>
>>>
>>> I'm not really sure what to do about this. All of those reworks seems
>>> like an extensive redesign of the patch, and considering the last CF is
>>> already over ... not great.
>>>
>>
>> Yeah, I share the same feeling that even if we devise solutions to all
>> the known problems it requires quite some time to ensure everything is
>> correct.
>>
>
> True. Let's keep working on this for a bit more time and then we can
> decide what to do.
>

I've pushed a revert af all the commits related to this - decoding of
sequences and test_decoding / built-in replication changes. The approach
combining transactional and non-transactional behavior implemented by
the patch clearly has issues, and it seems foolish to hope we'll find a
simple fix. So the changes would have to be much more extensive, and
doing that after the last CF seems like an obviously bad idea.

Attached is a rebased patch, implementing the approach based on
WAL-logging sequences at commit time.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
decoding-sequences-tracking-20220407.patch text/x-patch 576.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2022-04-07 18:42:55 Re: Preventing indirection for IndexPageGetOpaque for known-size page special areas
Previous Message Nathan Bossart 2022-04-07 18:29:54 avoid multiple hard links to same WAL file after a crash