Re: Logical Replication: adjacent COMMIT messages with the wrong StartLSN

From: Vladimir Gordiychuk <folyga(at)gmail(dot)com>
To: Dave Cramer <pg(at)fastcrypt(dot)com>
Cc: Stefan Smith <stefan(dot)smith(dot)work(at)gmail(dot)com>, List <pgsql-jdbc(at)postgresql(dot)org>
Subject: Re: Logical Replication: adjacent COMMIT messages with the wrong StartLSN
Date: 2017-04-04 19:24:51
Message-ID: CAFgjRd2YZ3N8-mtGPAYj_RY7t1MCxH_UPvR53sD-znn-b0V_iA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-jdbc

Hi.

Dave, thanks that invite me.

Stafan, I reproduce this issue in PR https://github.com/pgjdbc/
pgjdbc/pull/801
Right know *lastReceiveLSN* calculates as *startLSN* *+ payloadSize*. We
add this changes after this comment https://github.com/
pgjdbc/pgjdbc/pull/550#issuecomment-254427102
I think we should analyze it comment again. Because if use *lastReceiveLsn*
as *startLSN* without add payloadSize your testcase works as you want.

2017-04-04 15:34 GMT+03:00 Dave Cramer <pg(at)fastcrypt(dot)com>:

> + Vladimir as he is much more familiar with the replication protocol, etc
>
>
>
> Dave Cramer
>
> davec(at)postgresintl(dot)com
> www.postgresintl.com
>
> On 4 April 2017 at 08:11, Stefan Smith <stefan(dot)smith(dot)work(at)gmail(dot)com>
> wrote:
>
>> Hi,
>>
>> I have been evaluating the logical replication feature in the JDBC driver
>> v42.0.0, against postgres v9.6. One scenario I'm testing is streaming a
>> high volume of transactions, where each transaction contains multiple
>> INSERTs.
>>
>> Sometimes, two transaction COMMIT messages follow each other, one after
>> the other, in the DB transaction log, and so the transactions arrive one
>> after the other in the logical replication stream. This is expected
>> behaviour.
>>
>> I apply and flush the LastReceivedLSN and force an update to the server
>> after every COMMIT message.
>>
>> To stress-test the recovery behaviour, I'm also disconnecting and
>> reconnecting on every fourth INSERT received.
>>
>> In most cases I see the following expected behaviour:
>>
>> The WAL might have (with made-up LSNs for illustrative purposes):
>>
>> LSN=10 BEGIN (TXN 1)
>> LSN=20 BEGIN (TXN 2)
>> LSN=30 INSERT (TXN 1)
>> LSN=40 INSERT (TXN 2)
>> LSN=50 INSERT (TXN 1)
>> LSN=60 INSERT (TXN 2)
>> LSN=70 INSERT (TXN 1)
>> LSN=80 INSERT (TXN 2)
>> LSN=90 COMMIT (TXN 1)
>> LSN=100 COMMIT (TXN 2)
>>
>> And so the stream receives:
>>
>> BEGIN (TXN 1) LastReceivedLSN=19
>> INSERT (TXN 1) LastReceivedLSN=39
>> INSERT (TXN 1) LastReceivedLSN=59
>> INSERT (TXN 1) LastReceivedLSN=79
>> COMMIT (TXN 1) LastReceivedLSN=99
>> <JDBC APPLY & FLUSH 99, FORCE UPDATE>
>> BEGIN (TXN 2) LastReceivedLSN=29
>> INSERT (TXN 2) LastReceivedLSN=49
>> <JDBC RECONNECT>
>> BEGIN (TXN 2) LastReceivedLSN=29
>> INSERT (TXN 2) LastReceivedLSN=49
>> INSERT (TXN 2) LastReceivedLSN=69
>> INSERT (TXN 2) LastReceivedLSN=89
>> COMMIT (TXN 2) LastReceivedLSN=109
>>
>> The above behaviour makes sense since the replication slot's
>> confirmed_flush_lsn=99 upon reconnect.
>>
>> My issue: occasionally after reconnecting, I observe that the INSERTs for
>> TXN 2 are not resent and instead the stream moves on to TXN 3.
>>
>> With the same WAL as above, the stream looks like:
>>
>> BEGIN (TXN 1) LastReceivedLSN=19
>> INSERT (TXN 1) LastReceivedLSN=39
>> INSERT (TXN 1) LastReceivedLSN=59
>> INSERT (TXN 1) LastReceivedLSN=79
>> COMMIT (TXN 1) LastReceivedLSN=109 <-- This LSN is wrong!
>> <JDBC APPLY & FLUSH 109, FORCE UPDATE>
>> BEGIN (TXN 2) LastReceivedLSN=29
>> INSERT (TXN 2) LastReceivedLSN=49
>> <JDBC RECONNECT>
>> BEGIN (TXN 3) ...
>>
>> Debugging the JDBC driver and comparing it with the WAL (via
>> pg_xlogdump), it seems that on these occasions COMMIT (TXN 1) arrives with
>> the correct payload but with StartLSN set to the StartLSN of COMMIT (TXN
>> 2)! The JDBC driver computes COMMIT (TXN 1) LastReceivedLSN = COMMIT (TXN
>> 2) StartLSN + length of COMMIT (TXN 1) payload. This causes us to
>> unwittingly set confirmed_flush_lsn to the end WAL position of COMMIT (TXN
>> 2) before disconnecting, meaning I don't decode TXN 2 after reconnection.
>>
>> Is this a known issue? Is it caused by the JDBC driver or something in
>> the server?
>>
>> Thanks.
>>
>
>

In response to

Responses

Browse pgsql-jdbc by date

  From Date Subject
Next Message Vladimir Gordiychuk 2017-04-05 20:58:29 Re: Logical Replication: adjacent COMMIT messages with the wrong StartLSN
Previous Message Álvaro Hernández Tortosa 2017-04-04 16:01:26 Re: RFC: Make new versions of pgjdbc Java8+