CDC/ETL system on top of logical replication with pgoutput, custom client

From: José Neves <rafaneves3(at)msn(dot)com>
To: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: CDC/ETL system on top of logical replication with pgoutput, custom client
Date: 2023-07-29 23:07:24
Message-ID: PR3P193MB04917D0140AD7A7ACFAA9D2F8907A@PR3P193MB0491.EURP193.PROD.OUTLOOK.COM
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi there, hope to find you well.

I'm attempting to develop a CDC on top of Postgres, currently using 12, the last minor, with a custom client, and I'm running into issues with data loss caused by out-of-order logical replication messages.

The problem is as follows: postgres streams A, B, D, G, K, I, P logical replication events, upon exit signal we stop consuming new events at LSN K, and we wait 30s for out-of-order events. Let's say that we only got A, (and K ofc) so in the following 30s, we get B, D, however, for whatever reason, G never arrived. As with pgoutput-based logical replication we have no way to calculate the next LSN, we have no idea that G was missing, so we assumed that it all arrived, committing K to postgres slot and shutdown. In the next run, our worker will start receiving data from K forward, and G is lost forever...
Meanwhile postgres moves forward with archiving and we can't go back to check if we lost anything. And even if we could, would be extremely inefficient.

In sum, the issue comes from the fact that postgres will stream events with unordered LSNs on high transactional systems, and that pgoutput doesn't have access to enough information to calculate the next or last LSN, so we have no way to check if we receive all the data that we are supposed to receive, risking committing an offset that we shouldn't as we didn't receive yet preceding data.

It seems very either to me that none of the open-source CDC projects that I looked into care about this. They always assume that the next LSN received is... well the next one, and commit that one, so upon restart, they are vulnerable to the same issue. So... either I'm missing something... or we have a generalized assumption causing data loss under certain conditions all over.

Am I missing any postgres mechanism that will allow me to at least detect that I'm missing data?

Thanks in advance for any clues on how to deal with this. It has been driving me nuts.

*

Regards,
José Neves

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2023-07-29 23:33:18 Re: buildfarm instance bichir stuck
Previous Message Andres Freund 2023-07-29 21:51:24 Postmaster doesn't correctly handle crashes in PM_STARTUP state