Re: BUG #17695: Failed Assert in logical replication snapbuild.

From: bowenshi <zxwsbg(at)qq(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: pgsql-bugs(at)lists(dot)postgresql(dot)org <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: BUG #17695: Failed Assert in logical replication snapbuild.
Date: 2022-11-25 07:44:03
Message-ID: tencent_7EB71DA5D7BA00EB0B429DCE45D0452B6406@qq.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs




Thanks for your advice! I update my test script, now it can 100% reproduce problem in my computer. 





The new script and fix code are both in the patch. However, I met new problem after adding fix code. It fails into a new trap following:




TRAP: FailedAssertion("TransactionIdPrecedesOrEquals(safeXid, snap->xmin)", File: "snapbuild.c", Line: 593, PID: 1576)

postgres: master: walsender postgres [local] CREATE_REPLICATION_SLOT(ExceptionalCondition+0xb9)[0xb1ca28]

postgres: master: walsender postgres [local] CREATE_REPLICATION_SLOT(SnapBuildInitialSnapshot+0x1b3)[0x8f3c79]

postgres: master: walsender postgres [local] CREATE_REPLICATION_SLOT[0x9136b9]

postgres: master: walsender postgres [local] CREATE_REPLICATION_SLOT(exec_replication_command+0x398)[0x91456a]

postgres: master: walsender postgres [local] CREATE_REPLICATION_SLOT(PostgresMain+0x7be)[0x984e4a]

postgres: master: walsender postgres [local] CREATE_REPLICATION_SLOT[0x8c0d5b]

postgres: master: walsender postgres [local] CREATE_REPLICATION_SLOT[0x8c06cd]

postgres: master: walsender postgres [local] CREATE_REPLICATION_SLOT[0x8bc9de]

postgres: master: walsender postgres [local] CREATE_REPLICATION_SLOT(PostmasterMain+0x117a)[0x8bc2b5]

postgres: master: walsender postgres [local] CREATE_REPLICATION_SLOT[0x7bdb13]


The problem is not very hard to reproduce, and I’m fixing it now.




Regards,




--


bowenshi




---- Replied Message ----







From


Masahiko Sawada<sawada(dot)mshk(at)gmail(dot)com>




Date


11/25/2022 13:58




To


<zxwsbg(at)qq(dot)com> ,
<pgsql-bugs(at)lists(dot)postgresql(dot)org>




Subject


Re: BUG #17695: Failed Assert in logical replication snapbuild.






On Thu, Nov 24, 2022 at 7:28 PM PG Bug reporting form

<noreply(at)postgresql(dot)org> wrote:


The following bug has been logged on the website:

Bug reference:      17695

Logged by:          施博文

Email address:      zxwsbg(at)qq(dot)com

PostgreSQL version: 14.6

Operating system:   centos

Description:

In PG14 or higher version, I notice that SnapBuildRestore don't set

builder->next_phase_at=InvalidTransactionId .

But in SnapBuildSerialize function, the assert check this condition.

Assert(builder->next_phase_at == InvalidTransactionId);

This would cause some problems, and I have repeat it with the perl test case

which I would update later, the problem is following:

TRAP: FailedAssertion("builder->next_phase_at == InvalidTransactionId",

File: "snapbuild.c", Line: 1604, PID: 29974)

postgres: master: walsender postgres [local]

START_REPLICATION(ExceptionalCondition+0xb9)[0xb1c9bd]

postgres: master: walsender postgres [local] START_REPLICATION[0x8f548d]

postgres: master: walsender postgres [local]

START_REPLICATION(SnapBuildProcessRunningXacts+0x55)[0x8f4c5c]

postgres: master: walsender postgres [local] START_REPLICATION[0x8dd8be]

postgres: master: walsender postgres [local]

START_REPLICATION(LogicalDecodingProcessRecord+0xd1)[0x8dd243]

postgres: master: walsender postgres [local] START_REPLICATION[0x915eb1]

postgres: master: walsender postgres [local] START_REPLICATION[0x91520c]

postgres: master: walsender postgres [local] START_REPLICATION[0x913bdd]

postgres: master: walsender postgres [local]

START_REPLICATION(exec_replication_command+0x42c)[0x914593]

postgres: master: walsender postgres [local]

START_REPLICATION(PostgresMain+0x7be)[0x984ddf]

postgres: master: walsender postgres [local] START_REPLICATION[0x8c0d41]

postgres: master: walsender postgres [local] START_REPLICATION[0x8c06b3]

postgres: master: walsender postgres [local] START_REPLICATION[0x8bc9c4]

postgres: master: walsender postgres [local]

START_REPLICATION(PostmasterMain+0x117a)[0x8bc29b]

postgres: master: walsender postgres [local] START_REPLICATION[0x7bdaf9]

/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f815dd92555]

postgres: master: walsender postgres [local] START_REPLICATION[0x488d09]

Thank you for reporting the issue!

I could reproduce this issue with a small change and the following steps:

1. Add sleep after setting SNAPBUILD_BUILDING_SNAPSHOT.

@@ -1409,6 +1409,9 @@ SnapBuildFindSnapshot(SnapBuild *builder,

XLogRecPtr lsn, xl_running_xacts *runn

running->xcnt, running->nextXid)));

SnapBuildWaitSnapshot(running, running->nextXid);

+

+       elog(LOG, "sleep 10s");

+       pg_usleep(10 * 1000000);

}

2. Create and start database instance and create a table and the

replication slot 's1'.

create table test (i int);

select pg_logical_replication_slot('s1', 'test_decoding');

3. Create another replication slot "s2" while a transaction is running.

tx-1:

begin;

insert into test values (1);

tx-2:

select pg_logical_replication_slot('s2', 'test_decoding');

checkpoint;

commit;

Note that replication slot creation will wait for 10 sec after

commiting the tx-1, please go to the next step once the slot has been

created.

4. Start decoding on slot 's2'.

select pg_logical_slot_get_changes('s2', null, null);

The logical decoding will sleep. Please go to the next step after

seeing the log "sleep 10s".

5. While the logical decoding started at step 4 is sleeping, start

decoding on slot 's1'.

select pg_logical_slot_get_changes('s1', null, null);

6. After the decoding on slot 's1' wakes up, it fails due to the

assertion check.

This scenario simulates the case where the logical decoding (re)starts

where RUNNING_XACTS record having a running transactions and it

restores the serialized snapshot when decoding the next RUNNING_XACTS

record. Since we don't reset builder->next_phase_at when restoring a

serialized snapshot, the assertion check in SnapBuildSerialize fails.

Regarding the proposed patch, I've confirmed it fixes this issue. But

I think it's better to reset builder->next_phase_at right after the

following assertion check:

/* consistent snapshots have no next phase */

Assert(ondisk.builder.next_phase_at == InvalidTransactionId);

I could not reproduce this issue with your script in my environment. I

think it's better to include the reproducible test case in the patch

but I'm not sure how to do that without adding sleep/gdb attach.

Regards,

--

Masahiko Sawada

Amazon Web Services: https://aws.amazon.com

Attachment Content-Type Size
unknown_filename text/html 10.1 KB
0001-Fix-snapshot-builder-s-state-after-restoring-snapsho.patch application/octet-stream 4.0 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Hayato Kuroda (Fujitsu) 2022-11-25 09:07:56 RE: BUG #17695: Failed Assert in logical replication snapbuild.
Previous Message Masahiko Sawada 2022-11-25 05:58:40 Re: BUG #17695: Failed Assert in logical replication snapbuild.