From: | "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> |
Cc: | Alexander Lakhin <exclusion(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | RE: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639) |
Date: | 2023-06-13 04:18:31 |
Message-ID: | OS0PR01MB57162DAF28227A05581957519455A@OS0PR01MB5716.jpnprd01.prod.outlook.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Tuesday, June 13, 2023 12:04 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Wed, Jun 7, 2023 at 6:02 PM Tomas Vondra
> <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
> >
> >
> > Well, I think the issue is pretty clear - we end up with an initial
> > snapshot that's in between the ASSIGNMENT and NEW_CID, and because
> > NEW_CID has both xact and subxact XID it fails because we add two TXNs
> > with the same LSN, not realizing one of them is subxact.
> >
> > That's obviously wrong, although somewhat benign in production because
> > it only fails because of hitting an assert.
> >
>
> Doesn't this indicate that we can end up decoding a partial transaction when
> we restore a snapshot? Won't that be a problem even for production?
Yes, I think it can cause the problem that only partial changes of a transaction are streamed.
I tried to reproduce this and here are the steps. Note, to make sure the test
won't be affected by other running_xact WALs, I changed LOG_SNAPSHOT_INTERVAL_MS
to a bigger number.
session 1:
-----
create table test(a int);
SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot_1', 'test_decoding');
-----
session 2:
-----
- Start a transaction
BEGIN;
INSERT INTO test VALUES(1);
-----
session 3:
-----
- Create another slot isolation_slot_2, it should choose a restart_point which is
- after the changes that happened in session 2. Note, to let the current slot
- restore another snapshot, we need to use gdb to block the current backend at
- SnapBuildFindSnapshot(), the backend should have logged the running_xacts WAL
- before reaching SnapBuildFindSnapshot.
SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot_2', 'test_decoding');
-----
session 1:
-----
- Since there is a running_xacts which session 3 logged, the current backend will
- serialize the snapshot when decoding the running_xacts WAL, and the snapshot
- can be used by other slots (e.g. isolation_slot_2)
SELECT data FROM pg_logical_slot_get_changes('isolation_slot_1', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
-----
session 2:
-----
- Insert some different data and commit the transaction.
INSERT INTO test VALUES(2);
INSERT INTO test VALUES(3);
INSERT INTO test VALUES(4);
COMMIT
-----
session 3:
-----
- Release the process and try to stream the changes, since the restart point is
- at the middle of the transaction, it will stream partial changes of the
- transaction which was committed in session 2:
SELECT data FROM pg_logical_slot_get_changes('isolation_slot_2', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
-----
Results (partial streamed changes):
postgres=# SELECT data FROM pg_logical_slot_get_changes('isolation_slot_2', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
data
-----------------------------------------
BEGIN
table public.test: INSERT: a[integer]:2
table public.test: INSERT: a[integer]:3
table public.test: INSERT: a[integer]:4
COMMIT
(5 rows)
>
> > Regular builds are likely to
> > just ignore it, although I haven't checked if the COMMIT cleanup (I
> > wonder if we remove the subxact from the toplevel list on commit).
> >
> > I think the problem is we just grab an existing snapshot, before all
> > running xacts complete. Maybe we should fix that, and leave the
> > needs_full_snapshot alone.
> >
>
> It is not clear what exactly you have in mind to fix this because if there is no
> running xact, we don't even need to restore the snapshot because of a prior
> check "if (running->oldestRunningXid ==
> running->nextXid)". I think the main problem is that we started
> decoding immediately from the point where we restored a snapshot as at that
> point we could have some partial running xacts.
Best Regards,
Hou zj
From | Date | Subject | |
---|---|---|---|
Next Message | Dilip Kumar | 2023-06-13 04:26:46 | Re: index prefetching |
Previous Message | Andres Freund | 2023-06-13 04:13:35 | Re: pg_stat_io for the startup process |