Re: [HACKERS] logical decoding of two-phase transactions

From: Peter Smith <smithpb2250(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] logical decoding of two-phase transactions
Date: 2020-11-18 07:47:39
Message-ID: CAHut+PuEMk4SO8oGzxc_ftzPkGA8uC-y5qi-KRqHSy_P0i30DA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi.

Using a tablesync debugging technique as described in another mail
thread [1][2] I have caused the tablesync worker to handle (e.g.
apply_dispatch) a 2PC PREPARE

This exposes a problem with the current 2PC logic because if/when the
PREPARE is processed by the tablesync worker then the txn will end up
being COMMITTED, even though the 2PC PREPARE has not yet been COMMIT
PREPARED by the publisher.

For example, below is some logging (using my patch [2]) which shows
this occurring:

---

[postgres(at)CentOS7-x64 ~]$ psql -d test_sub -p 54321 -c "CREATE
SUBSCRIPTION tap_sub CONNECTION 'host=localhost dbname=test_pub
application_name=tap_sub' PUBLICATION tap_pub;"
2020-11-18 17:00:37.394 AEDT [15885] LOG: logical decoding found
consistent point at 0/16EF840
2020-11-18 17:00:37.394 AEDT [15885] DETAIL: There are no running transactions.
2020-11-18 17:00:37.394 AEDT [15885] STATEMENT:
CREATE_REPLICATION_SLOT "tap_sub" LOGICAL pgoutput NOEXPORT_SNAPSHOT
NOTICE: created replication slot "tap_sub" on publisher
CREATE SUBSCRIPTION
2020-11-18 17:00:37.407 AEDT [15886] LOG: logical replication apply
worker for subscription "tap_sub" has started
2020-11-18 17:00:37.407 AEDT [15886] LOG: !!>> The apply worker
process has PID = 15886
2020-11-18 17:00:37.415 AEDT [15887] LOG: starting logical decoding
for slot "tap_sub"
2020-11-18 17:00:37.415 AEDT [15887] DETAIL: Streaming transactions
committing after 0/16EF878, reading WAL from 0/16EF840.
2020-11-18 17:00:37.415 AEDT [15887] STATEMENT: START_REPLICATION
SLOT "tap_sub" LOGICAL 0/0 (proto_version '2', publication_names
'"tap_pub"')
2020-11-18 17:00:37.415 AEDT [15887] LOG: logical decoding found
consistent point at 0/16EF840
2020-11-18 17:00:37.415 AEDT [15887] DETAIL: There are no running transactions.
2020-11-18 17:00:37.415 AEDT [15887] STATEMENT: START_REPLICATION
SLOT "tap_sub" LOGICAL 0/0 (proto_version '2', publication_names
'"tap_pub"')
2020-11-18 17:00:37.415 AEDT [15886] LOG: !!>> apply worker:
LogicalRepApplyLoop
2020-11-18 17:00:37.415 AEDT [15886] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-18 17:00:37.421 AEDT [15889] LOG: logical replication table
synchronization worker for subscription "tap_sub", table "test_tab"
has started
2020-11-18 17:00:37.421 AEDT [15889] LOG: !!>> The tablesync worker
process has PID = 15889
2020-11-18 17:00:37.421 AEDT [15889] LOG: !!>>

Sleeping 30 secs. For debugging, attach to process 15889 now!

[postgres(at)CentOS7-x64 ~]$ 2020-11-18 17:00:38.431 AEDT [15886] LOG:
!!>> apply worker: LogicalRepApplyLoop
2020-11-18 17:00:38.431 AEDT [15886] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-18 17:00:39.433 AEDT [15886] LOG: !!>> apply worker:
LogicalRepApplyLoop
2020-11-18 17:00:39.433 AEDT [15886] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-18 17:00:40.437 AEDT [15886] LOG: !!>> apply worker:
LogicalRepApplyLoop
2020-11-18 17:00:40.437 AEDT [15886] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-18 17:00:41.439 AEDT [15886] LOG: !!>> apply worker:
LogicalRepApplyLoop
2020-11-18 17:00:41.439 AEDT [15886] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-18 17:00:42.441 AEDT [15886] LOG: !!>> apply worker:
LogicalRepApplyLoop
2020-11-18 17:00:42.441 AEDT [15886] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-18 17:00:43.442 AEDT [15886] LOG: !!>> apply worker:
LogicalRepApplyLoop
2020-11-18 17:00:43.442 AEDT [15886] LOG: !!>> apply worker: called
process_syncing_tables
-- etc.
2020-11-18 17:01:03.520 AEDT [15886] LOG: !!>> apply worker:
LogicalRepApplyLoop
2020-11-18 17:01:03.520 AEDT [15886] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-18 17:01:04.521 AEDT [15886] LOG: !!>> apply worker:
LogicalRepApplyLoop
2020-11-18 17:01:04.521 AEDT [15886] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-18 17:01:05.523 AEDT [15886] LOG: !!>> apply worker:
LogicalRepApplyLoop
2020-11-18 17:01:05.523 AEDT [15886] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-18 17:01:06.532 AEDT [15886] LOG: !!>> apply worker:
LogicalRepApplyLoop
2020-11-18 17:01:06.532 AEDT [15886] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-18 17:01:07.426 AEDT [15889] LOG: !!>> tablesync worker:
About to call LogicalRepSyncTableStart to do initial syncing
2020-11-18 17:01:07.536 AEDT [15886] LOG: !!>> apply worker:
LogicalRepApplyLoop
2020-11-18 17:01:07.536 AEDT [15886] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-18 17:01:07.536 AEDT [15886] LOG: !!>> apply worker:
LogicalRepApplyLoop
2020-11-18 17:01:07.536 AEDT [15886] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-18 17:01:08.539 AEDT [15886] LOG: !!>> apply worker:
LogicalRepApplyLoop
2020-11-18 17:01:08.539 AEDT [15886] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-18 17:01:09.541 AEDT [15886] LOG: !!>> apply worker:
LogicalRepApplyLoop
2020-11-18 17:01:09.541 AEDT [15886] LOG: !!>> apply worker: called
process_syncing_tables
-- etc.
2020-11-18 17:01:23.583 AEDT [15886] LOG: !!>> apply worker:
LogicalRepApplyLoop
2020-11-18 17:01:23.583 AEDT [15886] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-18 17:01:24.584 AEDT [15886] LOG: !!>> apply worker:
LogicalRepApplyLoop
2020-11-18 17:01:24.584 AEDT [15886] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-18 17:01:25.586 AEDT [15886] LOG: !!>> apply worker:
LogicalRepApplyLoop
2020-11-18 17:01:25.586 AEDT [15886] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-18 17:01:26.586 AEDT [15886] LOG: !!>> apply worker:
LogicalRepApplyLoop
2020-11-18 17:01:26.586 AEDT [15886] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-18 17:01:27.454 AEDT [17456] LOG: logical decoding found
consistent point at 0/16EF878
2020-11-18 17:01:27.454 AEDT [17456] DETAIL: There are no running transactions.
2020-11-18 17:01:27.454 AEDT [17456] STATEMENT:
CREATE_REPLICATION_SLOT "tap_sub_24582_sync_16385" TEMPORARY LOGICAL
pgoutput USE_SNAPSHOT
2020-11-18 17:01:27.456 AEDT [15886] LOG: !!>> apply worker:
LogicalRepApplyLoop
2020-11-18 17:01:27.457 AEDT [15886] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-18 17:01:27.465 AEDT [15889] LOG: !!>> tablesync worker: wait
for CATCHUP state notification
2020-11-18 17:01:27.465 AEDT [15886] LOG: !!>> apply worker:
LogicalRepApplyLoop
2020-11-18 17:01:27.465 AEDT [15886] LOG: !!>> apply worker: called
process_syncing_tables

#### Here, while the tablesync worker is paused in the debugger I
execute the PREPARE txn on publisher

psql -d test_pub -c "BEGIN;INSERT INTO test_tab VALUES(1,
'foo');PREPARE TRANSACTION 'test_prepared_tab';"
PREPARE TRANSACTION

2020-11-18 17:01:54.732 AEDT [15887] LOG: !!>>
pgoutput_begin_txn
2020-11-18 17:01:54.732 AEDT [15887] CONTEXT: slot "tap_sub", output
plugin "pgoutput", in the begin callback, associated LSN 0/16EF8B0
2020-11-18 17:01:54.732 AEDT [15887] STATEMENT: START_REPLICATION
SLOT "tap_sub" LOGICAL 0/0 (proto_version '2', publication_names
'"tap_pub"')

#### And then in the debugger I let the tablesync worker continue...

2020-11-18 17:02:02.788 AEDT [15889] LOG: !!>> tablesync worker:
received CATCHUP state notification
2020-11-18 17:02:07.729 AEDT [15889] LOG: !!>> tablesync worker:
Returned from LogicalRepSyncTableStart
2020-11-18 17:02:16.284 AEDT [17456] LOG: starting logical decoding
for slot "tap_sub_24582_sync_16385"
2020-11-18 17:02:16.284 AEDT [17456] DETAIL: Streaming transactions
committing after 0/16EF8B0, reading WAL from 0/16EF878.
2020-11-18 17:02:16.284 AEDT [17456] STATEMENT: START_REPLICATION
SLOT "tap_sub_24582_sync_16385" LOGICAL 0/16EF8B0 (proto_version '2',
publication_names '"tap_pub"')
2020-11-18 17:02:16.284 AEDT [17456] LOG: logical decoding found
consistent point at 0/16EF878
2020-11-18 17:02:16.284 AEDT [17456] DETAIL: There are no running transactions.
2020-11-18 17:02:16.284 AEDT [17456] STATEMENT: START_REPLICATION
SLOT "tap_sub_24582_sync_16385" LOGICAL 0/16EF8B0 (proto_version '2',
publication_names '"tap_pub"')
2020-11-18 17:02:16.284 AEDT [17456] LOG: !!>>
pgoutput_begin_txn
2020-11-18 17:02:16.284 AEDT [17456] CONTEXT: slot
"tap_sub_24582_sync_16385", output plugin "pgoutput", in the begin
callback, associated LSN 0/16EF8B0
2020-11-18 17:02:16.284 AEDT [17456] STATEMENT: START_REPLICATION
SLOT "tap_sub_24582_sync_16385" LOGICAL 0/16EF8B0 (proto_version '2',
publication_names '"tap_pub"')
2020-11-18 17:02:40.346 AEDT [15889] LOG: !!>> tablesync worker:
LogicalRepApplyLoop

#### The tablesync worker processes the replication messages....

2020-11-18 17:02:47.992 AEDT [15889] LOG: !!>> tablesync worker:
apply_dispatch for message kind 'B'
2020-11-18 17:02:54.858 AEDT [15889] LOG: !!>> tablesync worker:
apply_dispatch for message kind 'R'
2020-11-18 17:02:56.082 AEDT [15889] LOG: !!>> tablesync worker:
apply_dispatch for message kind 'I'
2020-11-18 17:02:56.082 AEDT [15889] LOG: !!>> tablesync worker:
should_apply_changes_for_rel: true
2020-11-18 17:02:57.354 AEDT [15889] LOG: !!>> tablesync worker:
apply_dispatch for message kind 'P'
2020-11-18 17:02:57.354 AEDT [15889] LOG: !!>> tablesync worker:
called process_syncing_tables
2020-11-18 17:02:59.011 AEDT [15889] LOG: logical replication table
synchronization worker for subscription "tap_sub", table "test_tab"
has finished

#### SInce the tablesync was "ahead", the apply worker now needs to
skip those same messages
#### Notice should_apply_changes_for_rel() is false
#### Then apply worker just waits for next messages....

2020-11-18 17:02:59.064 AEDT [15886] LOG: !!>> apply worker:
LogicalRepApplyLoop
2020-11-18 17:02:59.064 AEDT [15886] LOG: !!>> apply worker:
apply_dispatch for message kind 'B'
2020-11-18 17:02:59.064 AEDT [15886] LOG: !!>> apply worker:
apply_dispatch for message kind 'R'
2020-11-18 17:02:59.064 AEDT [15886] LOG: !!>> apply worker:
apply_dispatch for message kind 'I'
2020-11-18 17:02:59.065 AEDT [15886] LOG: !!>> apply worker:
should_apply_changes_for_rel: false
2020-11-18 17:02:59.065 AEDT [15886] LOG: !!>> apply worker:
apply_dispatch for message kind 'P'
2020-11-18 17:02:59.067 AEDT [15886] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-18 17:02:59.067 AEDT [15886] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-18 17:03:00.071 AEDT [15886] LOG: !!>> apply worker:
LogicalRepApplyLoop
2020-11-18 17:03:00.071 AEDT [15886] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-18 17:03:01.073 AEDT [15886] LOG: !!>> apply worker:
LogicalRepApplyLoop
2020-11-18 17:03:01.073 AEDT [15886] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-18 17:03:02.075 AEDT [15886] LOG: !!>> apply worker:
LogicalRepApplyLoop
2020-11-18 17:03:02.075 AEDT [15886] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-18 17:03:03.080 AEDT [15886] LOG: !!>> apply worker:
LogicalRepApplyLoop
2020-11-18 17:03:03.080 AEDT [15886] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-18 17:03:04.081 AEDT [15886] LOG: !!>> apply worker:
LogicalRepApplyLoop
2020-11-18 17:03:04.082 AEDT [15886] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-18 17:03:05.103 AEDT [15886] LOG: !!>> apply worker:
LogicalRepApplyLoop
2020-11-18 17:03:05.103 AEDT [15886] LOG: !!>> apply worker: called
process_syncing_tables
etc ...

#### At this point there is a problem because the tablesync worker has
COMMITTED that PREPARED INSERT.
#### See the subscriber node has ONE record but the publisher node has NONE!

[postgres(at)CentOS7-x64 ~]$ psql -d test_pub -c "SELECT count(*) FROM test_tab;"
count
-------
0
(1 row)

[postgres(at)CentOS7-x64 ~]$
[postgres(at)CentOS7-x64 ~]$ psql -d test_sub -p 54321 -c "SELECT
count(*) FROM test_tab;"
count
-------
1
(1 row)

[postgres(at)CentOS7-x64 ~]$

-----
[1] - https://www.postgresql.org/message-id/CAHut%2BPsprtsa4o89wtNnKLxxwXeDKAX9nNsdghT1Pv63siz%2BAA%40mail.gmail.com
[2] - https://www.postgresql.org/message-id/CAHut%2BPt4PyKQCwqzQ%3DEFF%3DbpKKJD7XKt_S23F6L20ayQNxg77A%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2020-11-18 08:06:48 Re: ResourceOwner refactoring
Previous Message Li Japin 2020-11-18 06:57:13 Re: Terminate the idle sessions