Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed

From: Alexander Lakhin <exclusion(at)gmail(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>, "Drouvot, Bertrand" <bertranddrouvot(dot)pg(at)gmail(dot)com>
Cc: "Yu Shi (Fujitsu)" <shiy(dot)fnst(at)fujitsu(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed
Date: 2024-01-10 14:00:01
Message-ID: 7c025095-5763-17a6-8c80-899b35bd0459@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

10.01.2024 07:26, Michael Paquier wrote:
> On Tue, Jan 09, 2024 at 08:00:00PM +0300, Alexander Lakhin wrote:
>> Thus just adding FREEZE is not enough, seemingly. It makes me wonder if
>> 0174c2d21 should be superseded by a patch like discussed (or just have
>> autovacuum = off added)...
> Adding an extra FREEZE offers an extra insurance, so I don't see why
> it would be a problem to add it to stabilize the horizon conflicts on
> the standbys.

As 0174c2d21 added FREEZE already, I meant to add "autovacuum = off" or
apply a fix similar to what we're are discussing here.

>
>> Michael, it definitely increases stability of the test (tens of iterations
>> with 20 tests in parallel performed successfully), although I've managed to
>> see another interesting failure (twice):
>> 13      #   Failed test 'activeslot slot invalidation is logged with vacuum on pg_class'
>> 13      #   at t/035_standby_logical_decoding.pl line 227.
> Something I'd like to confirm here: you still see this failure with
> the patch, but without an extra FREEZE, right? If we do both, the
> test would get more stable, wouldn't it?

Yes, I tested the patch as-is, without FREEZE, but it looks like it doesn't
matter in that case. And sorry for misleading information about missing
VACUUM records in my previous message, please ignore it.

10.01.2024 12:46, Bertrand Drouvot wrote:

>> although I've managed to
>> see another interesting failure (twice):
>> 13      #   Failed test 'activeslot slot invalidation is logged with vacuum on pg_class'
>> 13      #   at t/035_standby_logical_decoding.pl line 227.
>>
> Looking at the attached log files and particularly 1/regress_log_035_standby_logical_decoding:
>
> "
> [11:05:28.118](13.993s) ok 24 - inactiveslot slot invalidation is logged with vacuum on pg_class
> [11:05:28.119](0.001s) not ok 25 - activeslot slot invalidation is logged with vacuum on pg_class
> "
>
> That seems weird, the inactive slot has been invalidated while the active one is not.
> While it takes a bit longer to invalidate an active slot, I don't think the test can
> move on until both are invalidated (then leading to the tests 24 and 25)). I can
> see the tests are very slow to run (13.993s for 24) but still don't get how 24 could
> succeed while 25 does not.
>
> ...
>
> Would it be possible to also send the standby logs?

Yes, please look at the attached logs. This time I've build postgres with
-DWAL_DEBUG and ran tests with TEMP_CONFIG as below:
wal_keep_size=1GB
wal_debug = on
log_autovacuum_min_duration = 0
log_statement = 'all'
log_min_messages = INFO

The archive attached contains logs from four runs:
recovery-1-ok -- an example of successful run for reference
recovery-7-pruning and recovery-19-pruning -- failures with a failed
 subtest 'activeslot slot invalidation is logged with on-access pruning'
recovery-15-vacuum_pg_class -- a failure with a failed
 subtest 'activeslot slot invalidation is logged with vacuum on pg_class'

The distinction that I see in the failed run logs, for example in
recovery-15-vacuum_pg_class, 035_standby_logical_decoding_standby.log:
2024-01-10 11:00:18.700 UTC [789618] LOG:  REDO @ 0/4020220; LSN 0/4020250: prev 0/401FDE0; xid 753; len 20 -
Transaction/COMMIT: 2024-01-10 11:00:18.471694+00
2024-01-10 11:00:18.797 UTC [789618] LOG:  REDO @ 0/4020250; LSN 0/4020288: prev 0/4020220; xid 0; len 24 -
Standby/RUNNING_XACTS: nextXid 754 latestCompletedXid 753 oldestRunningXid 754
2024-01-10 11:00:19.013 UTC [789618] LOG:  REDO @ 0/4020288; LSN 0/40202C8: prev 0/4020250; xid 0; len 9; blkref #0: rel
1663/16384/2610, blk 0 - Heap2/PRUNE: snapshotConflictHorizon: 752, nredirected: 0, ndead: 1, isCatalogRel: T, nunused:
0, redirected: [], dead: [48], unused: []
2024-01-10 11:00:19.111 UTC [789618] LOG:  invalidating obsolete replication slot "row_removal_inactiveslot"
2024-01-10 11:00:19.111 UTC [789618] DETAIL:  The slot conflicted with xid horizon 752.
2024-01-10 11:00:19.111 UTC [789618] CONTEXT:  WAL redo at 0/4020288 for Heap2/PRUNE: snapshotConflictHorizon: 752,
nredirected: 0, ndead: 1, isCatalogRel: T, nunused: 0, redirected: [], dead: [48], unused: []; blkref #0: rel
1663/16384/2610, blk 0
2024-01-10 11:00:29.109 UTC [789618] LOG:  terminating process 790377 to release replication slot "row_removal_activeslot"
2024-01-10 11:00:29.109 UTC [789618] DETAIL:  The slot conflicted with xid horizon 752.
2024-01-10 11:00:29.109 UTC [789618] CONTEXT:  WAL redo at 0/4020288 for Heap2/PRUNE: snapshotConflictHorizon: 752,
nredirected: 0, ndead: 1, isCatalogRel: T, nunused: 0, redirected: [], dead: [48], unused: []; blkref #0: rel
1663/16384/2610, blk 0
2024-01-10 11:00:30.144 UTC [790377] 035_standby_logical_decoding.pl ERROR:  canceling statement due to conflict with
recovery
2024-01-10 11:00:30.144 UTC [790377] 035_standby_logical_decoding.pl DETAIL:  User was using a logical replication slot
that must be invalidated.
2024-01-10 11:00:30.144 UTC [790377] 035_standby_logical_decoding.pl STATEMENT:  START_REPLICATION SLOT
"row_removal_activeslot" LOGICAL 0/0 ("include-xids" '0', "skip-empty-xacts" '1')
2024-01-10 11:00:30.144 UTC [790377] 035_standby_logical_decoding.pl LOG:  released logical replication slot
"row_removal_activeslot"

is an absent message 'obsolete replication slot "row_removal_activeslot"'
and an additional record 'Standby/RUNNING_XACTS', which can be found in
035_standby_logical_decoding_primary.log:
2024-01-10 11:00:18.515 UTC [783410] LOG:  xlog bg flush request write 0/4020250; flush: 0/4020250, current is write
0/4020220; flush 0/4020220
2024-01-10 11:00:18.646 UTC [783387] LOG:  INSERT @ 0/4020288:  - Standby/RUNNING_XACTS: nextXid 754 latestCompletedXid
753 oldestRunningXid 754
2024-01-10 11:00:18.702 UTC [790526] 035_standby_logical_decoding.pl LOG:  statement: SELECT (select
txid_snapshot_xmin(txid_current_snapshot()) - 753) > 0
2024-01-10 11:00:18.724 UTC [783410] LOG:  xlog bg flush request write 0/4020288; flush: 0/4020288, current is write
0/4020250; flush 0/4020250

So perhaps it can affect an active slot invalidation?

Best regards,
Alexander

Attachment Content-Type Size
035-failures.tar.gz application/gzip 811.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bertrand Drouvot 2024-01-10 14:12:28 Re: System username in pg_stat_activity
Previous Message Magnus Hagander 2024-01-10 13:59:42 Re: System username in pg_stat_activity