From: | Alexander Lakhin <exclusion(at)gmail(dot)com> |
---|---|
To: | Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com> |
Cc: | Michael Paquier <michael(at)paquier(dot)xyz>, "Yu Shi (Fujitsu)" <shiy(dot)fnst(at)fujitsu(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed |
Date: | 2024-01-11 10:00:00 |
Message-ID: | 5d0e5bec-67f9-9164-36cb-c4ff5f95d1ed@gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi Bertrand,
10.01.2024 19:32, Bertrand Drouvot wrote:
>
>> is an absent message 'obsolete replication slot "row_removal_activeslot"'
> Looking at recovery-15-vacuum_pg_class/i035_standby_logical_decoding_standby.log here:
>
> Yeah, the missing message has to come from InvalidatePossiblyObsoleteSlot().
>
> In case of an active slot we first call ReportSlotInvalidation with the second
> parameter set to true (to emit the "terminating" message), then SIGTERM the active
> process and then (later) we should call the other ReportSlotInvalidation()
> call with the second parameter set to false (to emit the message that we don't
> see here).
>
> So it means InvalidatePossiblyObsoleteSlot() did not trigger the second ReportSlotInvalidation()
> call.
I've found a way to reproduce the issue without slowing down a machine or
running multiple tests in parallel. It's enough for this to add a delay to
allow BackgroundWriterMain() to execute LogStandbySnapshot():
@@ -692,2 +690,3 @@
$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'D';]);
+$node_primary->safe_psql('testdb', qq[SELECT pg_sleep(15);]);
$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'E';]);
With this delay, I get the failure immediately:
$ PROVE_TESTS="t/035*" TEMP_CONFIG=/tmp/extra.config make check -s -C src/test/recovery
# +++ tap check in src/test/recovery +++
t/035_standby_logical_decoding.pl .. 47/?
# Failed test 'activeslot slot invalidation is logged with on-access pruning'
# at t/035_standby_logical_decoding.pl line 227.
_primary.log contains:
2024-01-11 09:37:01.731 UTC [67656] 035_standby_logical_decoding.pl STATEMENT: UPDATE prun SET s = 'D';
2024-01-11 09:37:01.738 UTC [67664] 035_standby_logical_decoding.pl LOG: statement: SELECT pg_sleep(15);
2024-01-11 09:37:01.738 UTC [67664] 035_standby_logical_decoding.pl LOG: xlog flush request 0/404D8F0; write 0/0; flush
0/0 at character 8
2024-01-11 09:37:01.738 UTC [67664] 035_standby_logical_decoding.pl CONTEXT: writing block 14 of relation base/16384/1247
2024-01-11 09:37:01.738 UTC [67664] 035_standby_logical_decoding.pl STATEMENT: SELECT pg_sleep(15);
2024-01-11 09:37:01.905 UTC [67204] LOG: xlog flush request 0/404DA58; write 0/404BB00; flush 0/404BB00
2024-01-11 09:37:01.905 UTC [67204] CONTEXT: writing block 4 of relation base/16384/2673
2024-01-11 09:37:12.514 UTC [67204] LOG: INSERT @ 0/4057768: - Standby/RUNNING_XACTS: nextXid 769 latestCompletedXid
768 oldestRunningXid 769
2024-01-11 09:37:12.514 UTC [67206] LOG: xlog bg flush request write 0/4057768; flush: 0/4057768, current is write
0/4057730; flush 0/4057730
2024-01-11 09:37:16.760 UTC [67712] 035_standby_logical_decoding.pl LOG: statement: UPDATE prun SET s = 'E';
2024-01-11 09:37:16.760 UTC [67712] 035_standby_logical_decoding.pl LOG: INSERT @ 0/40577A8: - Heap2/PRUNE:
snapshotConflictHorizon: 768,...
Note RUNNING_XACTS here...
_standby.log contains:
2024-01-11 09:37:16.842 UTC [67606] LOG: invalidating obsolete replication slot "pruning_inactiveslot"
2024-01-11 09:37:16.842 UTC [67606] DETAIL: The slot conflicted with xid horizon 768.
2024-01-11 09:37:16.842 UTC [67606] CONTEXT: WAL redo at 0/4057768 for Heap2/PRUNE: snapshotConflictHorizon: 768, ...
and no 'invalidating obsolete replication slot "pruning_activeslot"' below.
Debug logging added (see attached) gives more information:
2024-01-11 09:37:16.842 UTC [67606] LOG: invalidating obsolete replication slot "pruning_inactiveslot"
2024-01-11 09:37:16.842 UTC [67606] DETAIL: The slot conflicted with xid horizon 768.
...
2024-01-11 09:37:16.842 UTC [67606] LOG: !!!InvalidatePossiblyObsoleteSlot| RS_INVAL_HORIZON, s: 0x7f7985475c10,
s->effective_xmin: 0, s->effective_catalog_xmin: 769, snapshotConflictHorizon: 768
...
2024-01-11 09:37:16.842 UTC [67606] LOG: !!!InvalidatePossiblyObsoleteSlot| conflict: 0
so the condition TransactionIdPrecedesOrEquals(s->effective_catalog_xmin,
snapshotConflictHorizon) is not satisfied, hence conflict = 0 and it breaks
the loop in InvalidatePossiblyObsoleteSlot().
Several lines above in the log we can see:
2024-01-11 09:37:12.514 UTC [67606] LOG: REDO @ 0/4057730; LSN 0/4057768: prev 0/4057700; xid 0; len 24 -
Standby/RUNNING_XACTS: nextXid 769 latestCompletedXid 768 oldestRunningXid 769
2024-01-11 09:37:12.540 UTC [67643] 035_standby_logical_decoding.pl LOG: !!!LogicalConfirmReceivedLocation|
MyReplicationSlot: 0x7f7985475c10, MyReplicationSlot->effective_catalog_xmin: 769
and that's the first occurrence of xid 769 in the log.
The decoded stack trace for the LogicalConfirmReceivedLocation call is:
ogicalConfirmReceivedLocation at logical.c:1886:1
ProcessStandbyReplyMessage at walsender.c:2327:1
ProcessStandbyMessage at walsender.c:2188:1
ProcessRepliesIfAny at walsender.c:2121:5
WalSndWaitForWal at walsender.c:1735:7
logical_read_xlog_page at walsender.c:1068:13
ReadPageInternal at xlogreader.c:1062:12
XLogDecodeNextRecord at xlogreader.c:601:5
XLogReadAhead at xlogreader.c:976:5
XLogReadRecord at xlogreader.c:406:3
XLogSendLogical at walsender.c:3229:5
WalSndLoop at walsender.c:2658:7
StartLogicalReplication at walsender.c:1477:2
exec_replication_command at walsender.c:1985:6
PostgresMain at postgres.c:4649:10
Best regards,
Alexander
Attachment | Content-Type | Size |
---|---|---|
adhoc-035-failure-debugging.patch | text/x-patch | 4.4 KB |
035-pruning-failure-logs.tar.gz | application/gzip | 149.0 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Dilip Kumar | 2024-01-11 10:12:08 | Re: Synchronizing slots from primary to standby |
Previous Message | Will Mortensen | 2024-01-11 09:51:20 | Re: Exposing the lock manager's WaitForLockers() to SQL |