BUG #15633: Data loss when reading the data from logical replication slot

From: PG Bug reporting form <noreply(at)postgresql(dot)org>
To: pgsql-bugs(at)lists(dot)postgresql(dot)org
Cc: nitesh(at)datacoral(dot)co
Subject: BUG #15633: Data loss when reading the data from logical replication slot
Date: 2019-02-12 20:19:53
Message-ID: 15633-fcbb5d143a6805b6@postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

The following bug has been logged on the website:

Bug reference: 15633
Logged by: Nitesh Yadav
Email address: nitesh(at)datacoral(dot)co
PostgreSQL version: 9.5.10
Operating system: AWS RDS
Description:

Hi,

Postgres Server setup:

Postgres server is running as AWS rds instance.
Server Version is PostgreSQL 9.5.10 on x86_64-pc-linux-gnu, compiled by gcc
(GCC) 4.8.3 20140911 (Red Hat 4.
With the following parameters group rds.logical_replication is set to
1.Which internally set the following flags: wal_level, max_wal_senders,
max_replication_slots, max_connections.
We are using test_decoding module for retrieving/read the WAL data through
the logical decoding mechanism.
Application setup:
Periodically we run the peek command to retrieve the data from the slot: eg
SELECT * FROM pg_logical_slot_peek_changes('pgldpublic_cdc_slot', NULL,
NULL, 'include-timestamp', 'on') LIMIT 200000 OFFSET 0;
From the above query result, we use location of last transaction to remove
the data from the slot: eg SELECT location, xid FROM
pg_logical_slot_get_changes('pgldpublic_cdc_slot', 'B92/C7394678', NULL,
'include-timestamp', 'on') LIMIT 1;
We runs Step 1 & 2 in the loop for reading data in the chunk of 200K records
at a time in a given process.

Behavior reported (Bug)
When we have a transaction of size more than 300K tables changes, we have
the following symptoms.
A process (p1) started reading the big transaction (xid = 780807879) ie
BEGIN and 104413 table changes (DELETE/INSERT).
Next process (p2) had read 200K records, which just contain only xid
(780807879) table changes (DELETE/INSERT), but no COMMIT for xid =
780807879.
Next process (p3) had read 200K records, but no tables change or no COMMIT
for same xid = 780807879. But we do see other complete transactions (ie
BEGIN & COMMIT).

BUG:
If a transaction (xid = 780807879) is started in p1, continued in p2 (but
not finished/commited) then why didn't p3 has any records for the same
transaction id.?
Do we partially lose the transaction (xid = 780807879) data?
Do we lose other transaction around the same time?
We are using the above application to replicate the production data from the
master to other analytics systems. Let us know if you need further details.
We would appreciate any help to further debug the missing transaction.

Regards,
Nitesh

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2019-02-12 21:21:46 Re: BUG #15633: Data loss when reading the data from logical replication slot
Previous Message Hugh Ranalli 2019-02-12 16:21:35 Re: BUG #15548: Unaccent does not remove combining diacritical characters