Re: Logical Replication: adjacent COMMIT messages with the wrong StartLSN

From: Dave Cramer <pg(at)fastcrypt(dot)com>
To: Stefan Smith <stefan(dot)smith(dot)work(at)gmail(dot)com>, Vladimir Gordiychuk <folyga(at)gmail(dot)com>
Cc: List <pgsql-jdbc(at)postgresql(dot)org>
Subject: Re: Logical Replication: adjacent COMMIT messages with the wrong StartLSN
Date: 2017-04-04 12:34:55
Message-ID: CADK3HHLwTM0iD8eqMn+1fkJ957H-WovUQFhP5GVqz5EgbO_B=Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-jdbc

+ Vladimir as he is much more familiar with the replication protocol, etc

Dave Cramer

davec(at)postgresintl(dot)com
www.postgresintl.com

On 4 April 2017 at 08:11, Stefan Smith <stefan(dot)smith(dot)work(at)gmail(dot)com> wrote:

> Hi,
>
> I have been evaluating the logical replication feature in the JDBC driver
> v42.0.0, against postgres v9.6. One scenario I'm testing is streaming a
> high volume of transactions, where each transaction contains multiple
> INSERTs.
>
> Sometimes, two transaction COMMIT messages follow each other, one after
> the other, in the DB transaction log, and so the transactions arrive one
> after the other in the logical replication stream. This is expected
> behaviour.
>
> I apply and flush the LastReceivedLSN and force an update to the server
> after every COMMIT message.
>
> To stress-test the recovery behaviour, I'm also disconnecting and
> reconnecting on every fourth INSERT received.
>
> In most cases I see the following expected behaviour:
>
> The WAL might have (with made-up LSNs for illustrative purposes):
>
> LSN=10 BEGIN (TXN 1)
> LSN=20 BEGIN (TXN 2)
> LSN=30 INSERT (TXN 1)
> LSN=40 INSERT (TXN 2)
> LSN=50 INSERT (TXN 1)
> LSN=60 INSERT (TXN 2)
> LSN=70 INSERT (TXN 1)
> LSN=80 INSERT (TXN 2)
> LSN=90 COMMIT (TXN 1)
> LSN=100 COMMIT (TXN 2)
>
> And so the stream receives:
>
> BEGIN (TXN 1) LastReceivedLSN=19
> INSERT (TXN 1) LastReceivedLSN=39
> INSERT (TXN 1) LastReceivedLSN=59
> INSERT (TXN 1) LastReceivedLSN=79
> COMMIT (TXN 1) LastReceivedLSN=99
> <JDBC APPLY & FLUSH 99, FORCE UPDATE>
> BEGIN (TXN 2) LastReceivedLSN=29
> INSERT (TXN 2) LastReceivedLSN=49
> <JDBC RECONNECT>
> BEGIN (TXN 2) LastReceivedLSN=29
> INSERT (TXN 2) LastReceivedLSN=49
> INSERT (TXN 2) LastReceivedLSN=69
> INSERT (TXN 2) LastReceivedLSN=89
> COMMIT (TXN 2) LastReceivedLSN=109
>
> The above behaviour makes sense since the replication slot's
> confirmed_flush_lsn=99 upon reconnect.
>
> My issue: occasionally after reconnecting, I observe that the INSERTs for
> TXN 2 are not resent and instead the stream moves on to TXN 3.
>
> With the same WAL as above, the stream looks like:
>
> BEGIN (TXN 1) LastReceivedLSN=19
> INSERT (TXN 1) LastReceivedLSN=39
> INSERT (TXN 1) LastReceivedLSN=59
> INSERT (TXN 1) LastReceivedLSN=79
> COMMIT (TXN 1) LastReceivedLSN=109 <-- This LSN is wrong!
> <JDBC APPLY & FLUSH 109, FORCE UPDATE>
> BEGIN (TXN 2) LastReceivedLSN=29
> INSERT (TXN 2) LastReceivedLSN=49
> <JDBC RECONNECT>
> BEGIN (TXN 3) ...
>
> Debugging the JDBC driver and comparing it with the WAL (via pg_xlogdump),
> it seems that on these occasions COMMIT (TXN 1) arrives with the correct
> payload but with StartLSN set to the StartLSN of COMMIT (TXN 2)! The JDBC
> driver computes COMMIT (TXN 1) LastReceivedLSN = COMMIT (TXN 2) StartLSN +
> length of COMMIT (TXN 1) payload. This causes us to unwittingly set
> confirmed_flush_lsn to the end WAL position of COMMIT (TXN 2) before
> disconnecting, meaning I don't decode TXN 2 after reconnection.
>
> Is this a known issue? Is it caused by the JDBC driver or something in the
> server?
>
> Thanks.
>

In response to

Responses

Browse pgsql-jdbc by date

  From Date Subject
Next Message Gamel Anton J. 2017-04-04 13:12:02 solved: Re: cannot install JDBC with ORACLE jdk1.8.0
Previous Message Stefan Smith 2017-04-04 12:11:23 Logical Replication: adjacent COMMIT messages with the wrong StartLSN