Re: Logical Replication: adjacent COMMIT messages with the wrong StartLSN

From: Vladimir Gordiychuk <folyga(at)gmail(dot)com>
To: Dave Cramer <pg(at)fastcrypt(dot)com>
Cc: Stefan Smith <stefan(dot)smith(dot)work(at)gmail(dot)com>, List <pgsql-jdbc(at)postgresql(dot)org>
Subject: Re: Logical Replication: adjacent COMMIT messages with the wrong StartLSN
Date: 2017-04-05 22:16:39
Message-ID: CAFgjRd2ii16o-8xsE40fqJ3PuQzEkYTi1ctMSJuLy3z8ShyUQQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-jdbc

Patch for fix this problem
https://github.com/pgjdbc/pgjdbc/pull/801/commits/4b4a521178b5a3cab63bd586982f6191ce32c240

2017-04-05 23:58 GMT+03:00 Vladimir Gordiychuk <folyga(at)gmail(dot)com>:

> The problem on pgjdbc side. The problem affect not only parallel
> transaction but any transaction that use own implementation output plugin.
> Logical replication for onCommit callback already send LSN end transaction,
> but we still add to this LSN payload. Add payload not correct for startLSN
> for calculate endLSN because logical replication can change size of message
> from WAL and we will calculate not valid LSN(for example LSN for future
> transaction).
> I will prepare a patch for fix this problem today or tomorrow.
>
> 2017-04-04 22:24 GMT+03:00 Vladimir Gordiychuk <folyga(at)gmail(dot)com>:
>
>> Hi.
>>
>> Dave, thanks that invite me.
>>
>> Stafan, I reproduce this issue in PR https://github.com/pgjdbc/p
>> gjdbc/pull/801
>> Right know *lastReceiveLSN* calculates as *startLSN* *+ payloadSize*. We
>> add this changes after this comment https://github.com/pgj
>> dbc/pgjdbc/pull/550#issuecomment-254427102
>> I think we should analyze it comment again. Because if use
>> *lastReceiveLsn* as *startLSN* without add payloadSize your testcase
>> works as you want.
>>
>>
>> 2017-04-04 15:34 GMT+03:00 Dave Cramer <pg(at)fastcrypt(dot)com>:
>>
>>> + Vladimir as he is much more familiar with the replication protocol, etc
>>>
>>>
>>>
>>> Dave Cramer
>>>
>>> davec(at)postgresintl(dot)com
>>> www.postgresintl.com
>>>
>>> On 4 April 2017 at 08:11, Stefan Smith <stefan(dot)smith(dot)work(at)gmail(dot)com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have been evaluating the logical replication feature in the JDBC
>>>> driver v42.0.0, against postgres v9.6. One scenario I'm testing is
>>>> streaming a high volume of transactions, where each transaction contains
>>>> multiple INSERTs.
>>>>
>>>> Sometimes, two transaction COMMIT messages follow each other, one after
>>>> the other, in the DB transaction log, and so the transactions arrive one
>>>> after the other in the logical replication stream. This is expected
>>>> behaviour.
>>>>
>>>> I apply and flush the LastReceivedLSN and force an update to the server
>>>> after every COMMIT message.
>>>>
>>>> To stress-test the recovery behaviour, I'm also disconnecting and
>>>> reconnecting on every fourth INSERT received.
>>>>
>>>> In most cases I see the following expected behaviour:
>>>>
>>>> The WAL might have (with made-up LSNs for illustrative purposes):
>>>>
>>>> LSN=10 BEGIN (TXN 1)
>>>> LSN=20 BEGIN (TXN 2)
>>>> LSN=30 INSERT (TXN 1)
>>>> LSN=40 INSERT (TXN 2)
>>>> LSN=50 INSERT (TXN 1)
>>>> LSN=60 INSERT (TXN 2)
>>>> LSN=70 INSERT (TXN 1)
>>>> LSN=80 INSERT (TXN 2)
>>>> LSN=90 COMMIT (TXN 1)
>>>> LSN=100 COMMIT (TXN 2)
>>>>
>>>> And so the stream receives:
>>>>
>>>> BEGIN (TXN 1) LastReceivedLSN=19
>>>> INSERT (TXN 1) LastReceivedLSN=39
>>>> INSERT (TXN 1) LastReceivedLSN=59
>>>> INSERT (TXN 1) LastReceivedLSN=79
>>>> COMMIT (TXN 1) LastReceivedLSN=99
>>>> <JDBC APPLY & FLUSH 99, FORCE UPDATE>
>>>> BEGIN (TXN 2) LastReceivedLSN=29
>>>> INSERT (TXN 2) LastReceivedLSN=49
>>>> <JDBC RECONNECT>
>>>> BEGIN (TXN 2) LastReceivedLSN=29
>>>> INSERT (TXN 2) LastReceivedLSN=49
>>>> INSERT (TXN 2) LastReceivedLSN=69
>>>> INSERT (TXN 2) LastReceivedLSN=89
>>>> COMMIT (TXN 2) LastReceivedLSN=109
>>>>
>>>> The above behaviour makes sense since the replication slot's
>>>> confirmed_flush_lsn=99 upon reconnect.
>>>>
>>>> My issue: occasionally after reconnecting, I observe that the INSERTs
>>>> for TXN 2 are not resent and instead the stream moves on to TXN 3.
>>>>
>>>> With the same WAL as above, the stream looks like:
>>>>
>>>> BEGIN (TXN 1) LastReceivedLSN=19
>>>> INSERT (TXN 1) LastReceivedLSN=39
>>>> INSERT (TXN 1) LastReceivedLSN=59
>>>> INSERT (TXN 1) LastReceivedLSN=79
>>>> COMMIT (TXN 1) LastReceivedLSN=109 <-- This LSN is wrong!
>>>> <JDBC APPLY & FLUSH 109, FORCE UPDATE>
>>>> BEGIN (TXN 2) LastReceivedLSN=29
>>>> INSERT (TXN 2) LastReceivedLSN=49
>>>> <JDBC RECONNECT>
>>>> BEGIN (TXN 3) ...
>>>>
>>>> Debugging the JDBC driver and comparing it with the WAL (via
>>>> pg_xlogdump), it seems that on these occasions COMMIT (TXN 1) arrives with
>>>> the correct payload but with StartLSN set to the StartLSN of COMMIT (TXN
>>>> 2)! The JDBC driver computes COMMIT (TXN 1) LastReceivedLSN = COMMIT (TXN
>>>> 2) StartLSN + length of COMMIT (TXN 1) payload. This causes us to
>>>> unwittingly set confirmed_flush_lsn to the end WAL position of COMMIT (TXN
>>>> 2) before disconnecting, meaning I don't decode TXN 2 after reconnection.
>>>>
>>>> Is this a known issue? Is it caused by the JDBC driver or something in
>>>> the server?
>>>>
>>>> Thanks.
>>>>
>>>
>>>
>>
>

In response to

Responses

Browse pgsql-jdbc by date

  From Date Subject
Next Message Neha Khatri 2017-04-06 04:04:22 Failure in regression tests due to absence of user 'postgres'
Previous Message Vladimir Gordiychuk 2017-04-05 20:58:29 Re: Logical Replication: adjacent COMMIT messages with the wrong StartLSN