Re: Single transaction in the tablesync worker?

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Peter Smith <smithpb2250(at)gmail(dot)com>
Cc: Craig Ringer <craig(dot)ringer(at)enterprisedb(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Petr Jelinek <petr(dot)jelinek(at)enterprisedb(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Ajin Cherian <itsajin(at)gmail(dot)com>
Subject: Re: Single transaction in the tablesync worker?
Date: 2020-12-08 10:16:05
Message-ID: CAA4eK1KHwkYXj0Jy3RXsffPGevNeRjSf2KDKvfNrWumNQeSemQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Dec 8, 2020 at 11:53 AM Peter Smith <smithpb2250(at)gmail(dot)com> wrote:
>
> Yes, I observed this same behavior.
>
> IIUC the only way for the tablesync worker to go from CATCHUP mode to
> SYNCDONE is via the call to process_sync_tables.
>
> But a side-effect of this is, when messages arrive during this CATCHUP
> phase one tx will be getting handled by the tablesync worker before
> the process_sync_tables() is ever encountered.
>
> I have created and attached a simple patch which allows the tablesync
> to detect if there is anything to do *before* it enters the apply main
> loop. Calling process_sync_tables() before the apply main loop offers
> a quick way out so the message handling will not be split
> unnecessarily between the workers.
>

Yeah, this demonstrates the idea can work but as mentioned in my
previous email [1] this needs much more work to make the COPY and then
later fetching the changes from the publisher consistently. So, let me
summarize the discussion so far. We wanted to enhance the tablesync
phase of Logical Replication to enable decoding of prepared
transactions [2]. The problem was when we stream prepared transactions
in the tablesync worker, it will simply commit the same due to the
requirement of maintaining a single transaction for the entire
duration of copy and streaming of transactions afterward. We can't
simply disable the decoding of prepared xacts for tablesync workers
because it can skip some of the prepared xacts forever on subscriber
as explained in one of the emails above [3]. Now, while investigating
the solutions to enhance tablesync to support decoding at prepare
time, I found that due to the current design of tablesync we can see
partial data of transactions on subscribers which is also explained in
the email above with an example [4]. This problem of visibility is
there since the Logical Replication is introduced in PostgreSQL and
the only answer I got till now is that there doesn't seem to be any
other alternative which I think is not true and I have provided one
alternative as well.

Next, we have discussed three different solutions all of which will
solve the first problem (allow the tablesync worker to decode
transactions at prepare time) and one of which solves both the first
and second problem (partial transaction data visibility).

Solution-1: Allow the table-sync worker to use multiple transactions.
The reason for doing it in a single transaction is that if after
initial COPY we commit and then crash while streaming changes of other
transactions, the state of the table won't be known after the restart
as we are using temporary slot so we don't from where to restart
syncing the table.

IIUC, we need to primarily do two things to achieve multiple
transactions, one is to have an additional state in the catalog (say
catch up) which will say that the initial copy is done. Then we need
to have a permanent slot using which we can track the progress of the
slot so that after restart (due to crash, connection break, etc.) we
can start from the appropriate position. Now, this will allow us to do
less work after recovering from a crash because we will know the
restart point. As Craig mentioned, it also allows the sync slot to
advance, freeing any held upstream resources before the whole sync is
done, which is good if the upstream is busy and generating lots of
WAL. Finally, committing as we go means we won't exceed the cid
increment limit in a single txn.

Solution-2: The next solution we discussed is to make "tablesync"
worker just gather messages after COPY in a similar way to how the
current streaming of in-progress transaction feature gathers messages
into a "changes" file so that they can be replayed later by the apply
worker. Now, here as we don't need to replay the individual
transactions in tablesync worker in a single transaction, it will
allow us to send decode prepared to the subscriber. This has some
disadvantages such as each transaction processed by tablesync worker
needs to be durably written to file and it can also lead to some apply
lag later when we process the same by apply worker.

Solution-3: Allow the table-sync workers to just perform initial COPY
and then once the COPY is done for all relations the apply worker will
stream all the future changes. Now, surely if large copies are
required for multiple relations then we would delay a bit to replay
transactions partially by the apply worker but don't know how much
that matters as compared to transaction visibility issue and anyway we
would have achieved the maximum parallelism by allowing copy via
multiple workers. This would reduce the load (both CPU and I/O) on the
publisher-side by allowing to decode the data only once instead of for
each table sync worker once and separately for the apply worker. I
think it will use fewer resources to finish the work.

Currently, in tablesync worker, we create a slot with CRS_USE_SNAPSHOT
option which creates a transaction snapshot on the publisher, and then
we use the same snapshot for COPY from the publisher. After this, when
we try to receive the data from the publisher using the same slot, it
will be in sync with the COPY. I think to keep the same consistency
between COPY and the data we receive from the publisher in this
approach, we need to export the snapshot while creating a slot in the
apply worker by using CRS_EXPORT_SNAPSHOT and then use the same
snapshot by all the tablesync workers doing the copy. In tablesync
workers, we can use the SET TRANSACTION SNAPSHOT command after "BEGIN
READ ONLY ISOLATION LEVEL REPEATABLE READ" to use the exported
snapshot. That way the COPY will use the same snapshot as is used for
receiving the changes in apply worker and the data will be in sync.

Then we also need a way to export snapshot while the apply worker is
already receiving the changes because users can use 'ALTER
SUBSCRIPTION name REFRESH PUBLICATION' which allows new tables to be
synced. I think we need to introduce a new command in
exec_replication_command() to export the snapshot from the existing
slot and then use it by the new tablesync worker.

Among the above three solutions, the first two will solve the first
problem (allow the tablesync worker to decode transactions at prepare
time) and the third solution will solve both the first and second
problem (partial transaction data visibility). The third solution
requires quite some redesign of how the Logical Replication work is
synchronized between apply and tablesync workers and might turn out to
be a bigger implementation effort. I am tentatively thinking to go
with a first or second solution at this stage and anyway if later
people feel that we need some bigger redesign then we can go with
something on the lines of Solution-3.

Thoughts?

[1] - https://www.postgresql.org/message-id/CAA4eK1%2BQC74wRQmbYT%2BMmOs%3DYbdUjuq0_A9CBbVoQMB1Ryi-OA%40mail.gmail.com
[2] - https://www.postgresql.org/message-id/CAHut+PuEMk4SO8oGzxc_ftzPkGA8uC-y5qi-KRqHSy_P0i30DA@mail.gmail.com
[3] - https://www.postgresql.org/message-id/CAA4eK1KFsjf6x-S7b0dJLvEL3tcn9x-voBJiFoGsccyH5xgDzQ%40mail.gmail.com
[4] - https://www.postgresql.org/message-id/CAA4eK1Ld9XaLoTZCoKF_gET7kc1fDf8CPR3CM48MQb1N1jDLYg%40mail.gmail.com

--
With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrey Borodin 2020-12-08 10:26:27 Re: POC: Better infrastructure for automated testing of concurrency issues
Previous Message Gilles Darold 2020-12-08 10:15:12 [PATCH] Hooks at XactCommand level