Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed

From: Alexander Lakhin <exclusion(at)gmail(dot)com>
To: Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, "Yu Shi (Fujitsu)" <shiy(dot)fnst(at)fujitsu(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed
Date: 2024-01-11 10:00:00
Message-ID: 5d0e5bec-67f9-9164-36cb-c4ff5f95d1ed@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Bertrand,

10.01.2024 19:32, Bertrand Drouvot wrote:
>
>> is an absent message 'obsolete replication slot "row_removal_activeslot"'
> Looking at recovery-15-vacuum_pg_class/i035_standby_logical_decoding_standby.log here:
>
> Yeah, the missing message has to come from InvalidatePossiblyObsoleteSlot().
>
> In case of an active slot we first call ReportSlotInvalidation with the second
> parameter set to true (to emit the "terminating" message), then SIGTERM the active
> process and then (later) we should call the other ReportSlotInvalidation()
> call with the second parameter set to false (to emit the message that we don't
> see here).
>
> So it means InvalidatePossiblyObsoleteSlot() did not trigger the second ReportSlotInvalidation()
> call.

I've found a way to reproduce the issue without slowing down a machine or
running multiple tests in parallel. It's enough for this to add a delay to
allow BackgroundWriterMain() to execute LogStandbySnapshot():
@@ -692,2 +690,3 @@
 $node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'D';]);
+$node_primary->safe_psql('testdb', qq[SELECT pg_sleep(15);]);
 $node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'E';]);

With this delay, I get the failure immediately:
$ PROVE_TESTS="t/035*" TEMP_CONFIG=/tmp/extra.config make check -s -C src/test/recovery
# +++ tap check in src/test/recovery +++
t/035_standby_logical_decoding.pl .. 47/?
#   Failed test 'activeslot slot invalidation is logged with on-access pruning'
#   at t/035_standby_logical_decoding.pl line 227.

_primary.log contains:
2024-01-11 09:37:01.731 UTC [67656] 035_standby_logical_decoding.pl STATEMENT:  UPDATE prun SET s = 'D';
2024-01-11 09:37:01.738 UTC [67664] 035_standby_logical_decoding.pl LOG:  statement: SELECT pg_sleep(15);
2024-01-11 09:37:01.738 UTC [67664] 035_standby_logical_decoding.pl LOG:  xlog flush request 0/404D8F0; write 0/0; flush
0/0 at character 8
2024-01-11 09:37:01.738 UTC [67664] 035_standby_logical_decoding.pl CONTEXT:  writing block 14 of relation base/16384/1247
2024-01-11 09:37:01.738 UTC [67664] 035_standby_logical_decoding.pl STATEMENT:  SELECT pg_sleep(15);
2024-01-11 09:37:01.905 UTC [67204] LOG:  xlog flush request 0/404DA58; write 0/404BB00; flush 0/404BB00
2024-01-11 09:37:01.905 UTC [67204] CONTEXT:  writing block 4 of relation base/16384/2673
2024-01-11 09:37:12.514 UTC [67204] LOG:  INSERT @ 0/4057768:  - Standby/RUNNING_XACTS: nextXid 769 latestCompletedXid
768 oldestRunningXid 769
2024-01-11 09:37:12.514 UTC [67206] LOG:  xlog bg flush request write 0/4057768; flush: 0/4057768, current is write
0/4057730; flush 0/4057730
2024-01-11 09:37:16.760 UTC [67712] 035_standby_logical_decoding.pl LOG:  statement: UPDATE prun SET s = 'E';
2024-01-11 09:37:16.760 UTC [67712] 035_standby_logical_decoding.pl LOG:  INSERT @ 0/40577A8:  - Heap2/PRUNE:
snapshotConflictHorizon: 768,...

Note RUNNING_XACTS here...

_standby.log contains:
2024-01-11 09:37:16.842 UTC [67606] LOG:  invalidating obsolete replication slot "pruning_inactiveslot"
2024-01-11 09:37:16.842 UTC [67606] DETAIL:  The slot conflicted with xid horizon 768.
2024-01-11 09:37:16.842 UTC [67606] CONTEXT:  WAL redo at 0/4057768 for Heap2/PRUNE: snapshotConflictHorizon: 768, ...
and no 'invalidating obsolete replication slot "pruning_activeslot"' below.

Debug logging added (see attached) gives more information:
2024-01-11 09:37:16.842 UTC [67606] LOG:  invalidating obsolete replication slot "pruning_inactiveslot"
2024-01-11 09:37:16.842 UTC [67606] DETAIL:  The slot conflicted with xid horizon 768.
...
2024-01-11 09:37:16.842 UTC [67606] LOG: !!!InvalidatePossiblyObsoleteSlot| RS_INVAL_HORIZON, s: 0x7f7985475c10,
s->effective_xmin: 0, s->effective_catalog_xmin: 769, snapshotConflictHorizon: 768
...
2024-01-11 09:37:16.842 UTC [67606] LOG: !!!InvalidatePossiblyObsoleteSlot| conflict: 0

so the condition TransactionIdPrecedesOrEquals(s->effective_catalog_xmin,
    snapshotConflictHorizon) is not satisfied, hence conflict = 0 and it breaks
the loop in InvalidatePossiblyObsoleteSlot().
Several lines above in the log we can see:
2024-01-11 09:37:12.514 UTC [67606] LOG:  REDO @ 0/4057730; LSN 0/4057768: prev 0/4057700; xid 0; len 24 -
Standby/RUNNING_XACTS: nextXid 769 latestCompletedXid 768 oldestRunningXid 769
2024-01-11 09:37:12.540 UTC [67643] 035_standby_logical_decoding.pl LOG:  !!!LogicalConfirmReceivedLocation|
MyReplicationSlot: 0x7f7985475c10, MyReplicationSlot->effective_catalog_xmin: 769

and that's the first occurrence of xid 769 in the log.

The decoded stack trace for the LogicalConfirmReceivedLocation call is:
ogicalConfirmReceivedLocation at logical.c:1886:1
ProcessStandbyReplyMessage at walsender.c:2327:1
ProcessStandbyMessage at walsender.c:2188:1
ProcessRepliesIfAny at walsender.c:2121:5
WalSndWaitForWal at walsender.c:1735:7
logical_read_xlog_page at walsender.c:1068:13
ReadPageInternal at xlogreader.c:1062:12
XLogDecodeNextRecord at xlogreader.c:601:5
XLogReadAhead at xlogreader.c:976:5
XLogReadRecord at xlogreader.c:406:3
XLogSendLogical at walsender.c:3229:5
WalSndLoop at walsender.c:2658:7
StartLogicalReplication at walsender.c:1477:2
exec_replication_command at walsender.c:1985:6
PostgresMain at postgres.c:4649:10

Best regards,
Alexander

Attachment Content-Type Size
adhoc-035-failure-debugging.patch text/x-patch 4.4 KB
035-pruning-failure-logs.tar.gz application/gzip 149.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dilip Kumar 2024-01-11 10:12:08 Re: Synchronizing slots from primary to standby
Previous Message Will Mortensen 2024-01-11 09:51:20 Re: Exposing the lock manager's WaitForLockers() to SQL