RE: Time delayed LR (WAS Re: logical replication restrictions)

From: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>
To: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Cc: Önder Kalacı <onderkalaci(at)gmail(dot)com>, "Yu Shi (Fujitsu)" <shiy(dot)fnst(at)fujitsu(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, "andres(at)anarazel(dot)de" <andres(at)anarazel(dot)de>, "vignesh21(at)gmail(dot)com" <vignesh21(at)gmail(dot)com>, "shveta(dot)malik(at)gmail(dot)com" <shveta(dot)malik(at)gmail(dot)com>, "Takamichi Osumi (Fujitsu)" <osumi(dot)takamichi(at)fujitsu(dot)com>, "dilipbalaut(at)gmail(dot)com" <dilipbalaut(at)gmail(dot)com>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, "euler(at)eulerto(dot)com" <euler(at)eulerto(dot)com>, "m(dot)melihmutlu(at)gmail(dot)com" <m(dot)melihmutlu(at)gmail(dot)com>, "marcos(at)f10(dot)com(dot)br" <marcos(at)f10(dot)com(dot)br>, 'Amit Kapila' <amit(dot)kapila16(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Subject: RE: Time delayed LR (WAS Re: logical replication restrictions)
Date: 2023-06-12 11:39:43
Message-ID: TYAPR01MB586649419227CFDC148CEF75F554A@TYAPR01MB5866.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Dear hackers,

At PGcon and other places we have discussed the time-delayed logical replication,
but now we have understood that there are no easy ways. Followings are our analysis.

# Abstract

To implement the time-dealyed logical replication for more proper approach,
the worker must serialize all the received messages into permanent files.
But PostgreSQL does not have good infrastructures for the purpose so huge engineering is needed.

## Review: problem of without-file approach

In the without-file approach, the apply worker process sleeps while delaying the application.
This approach is chosen in earlier versions like [1], but it contains problems which was
shared by Sawada-san [2]. They lead the PANIC error due to the disk full.

A) WALs cannot be recycled on publisher because they are not flushed on subscriber.
B) Moreover, vacuuming cannot remove dead tuples on publisher.

## Alternative approach: serializing messages to files

To prevent any potential issues, the worker should serialize all incoming messages
to a permanent file, like what the physical walreceiver does.
Here, messages are first written into files at the beginning of transactions and then flushed at the end.
This approach could slove problem a), b), but it still has many considerations and difficulties.

### How to separate messages into files?

There are two possibilities for dividing messages into files, but neither of them is ideal.

1. Create a file per received transaction.

In this case files will be removed after the delay-period is exceeded and it is applied.
This is the simplest approach, but the number of files is bloat.

2. Use one large file or segmented file (like WAL).

This can reduce the number of files, but we must consider further things:

A) Purge – We must purge the applied transaction, but we do not have a good way
to remove one transaction from the large file.

B) 2PC – It is more likely to occur that the segment which contains the actual
transaction differs from the segment where COMMIT PREPARED.
Hence the worker must check all the segments to find the actual messages from them.

C) Streamed in-progress transactions - chunks of transactions are separated
into several segments. Hence the worker must check all the segments to find
chunks messages from them, same as above.

### Handle the case when the file exceeds the limitation

Regardless of the option chosen from the ones mentioned above, there is a possibility
that the file size could exceed the file system's limit. This can occur as the
publisher can send transactions of any length.
PostgreSQL provides a mechanism for working with such large files - BufFile data structure,
but it could not be used as-is for several reasons:

A) It only supports the buffered-I/O. A read or write of the low-level File
occurs only when the buffer is filled or emptied. So, we cannot control when it is persisted.

B) It can be used only for temporary purpose. Internally the BufFile creates
some physical files into $PGDATA/base/pgsql_tmp directories, and files in the
subdirectory will be removed when postmaster restarts.

C) It does not have mechanisms for restoring information after the restart.
BufFile contains virtual positions such as file index and offset, but these
fields are stored in a memory structure, so the BufFile will forget the ordering
of files and its initial/final position after restarts.

D) It cannot remove a part of virtual file. Even if a large file is separated
into multiple physical files and all transactions in a physical file are already
applied, BufFile cannot remove only one part.

[1]: https://www.postgresql.org/message-id/f026292b-c9ee-472e-beaa-d32c5c3a2ced%40www.fastmail.com
[2]: https://www.postgresql.org/message-id/CAD21AoAeG2+RsUYD9+mEwr8-rrt8R1bqpe56T2D=euO-Qs-GAg@mail.gmail.com

Acknowledgement:

Amit, Peter, Sawada-san
Thank you for discussing with me off-list.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2023-06-12 11:53:13 Re: Let's make PostgreSQL multi-threaded
Previous Message Daniel Verite 2023-06-12 09:37:58 Re: Order changes in PG16 since ICU introduction