Re: A failure in 031_recovery_conflict.pl on Debian/s390x

From: Alexander Lakhin <exclusion(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Christoph Berg <myon(at)debian(dot)org>, Noah Misch <noah(at)leadboat(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Melanie Plageman <melanieplageman(at)gmail(dot)com>
Subject: Re: A failure in 031_recovery_conflict.pl on Debian/s390x
Date: 2023-10-15 06:00:00
Message-ID: b2a1f7d0-7629-72c0-2da7-e9c4e336b010@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

13.08.2023 00:00, Andres Freund wrote:
> On 2023-08-12 15:50:24 +1200, Thomas Munro wrote:
>> Thanks. I realised that it's easy enough to test that theory about
>> cleanup locks by hacking ConditionalLockBufferForCleanup() to return
>> false randomly. Then the test occasionally fails as described. Seems
>> like we'll need to fix that test, but it's not evidence of a server
>> bug, and my signal handler refactoring patch is in the clear. Thanks
>> for testing it!
> WRT fixing the test: I think just using VACUUM FREEZE ought to do the job?
> After changing all the VACUUMs to VACUUM FREEZEs, 031_recovery_conflict.pl
> passes even after I make ConditionalLockBufferForCleanup() fail 100%.

Yesterday I wondered what's is the reason of other failure on skink [1],
[2], [3]:
#   Failed test 'inactiveslot slot invalidation is logged with vacuum on pg_authid'
#   at t/035_standby_logical_decoding.pl line 224.

I've reproduced the failure by running several test instances in parallel
(with 15 jobs) under Valgrind and saw a difference in the primary node WAL:
The failed run:
rmgr: Heap        len (rec/tot):     54/    54, tx:        752, lsn: 0/0403FB88, prev 0/0403FB48, desc: DELETE xmax:
752, off: 16, infobits: [KEYS_UPDATED], flags: 0x00, blkref #0: rel 1664/0/1260 blk 0
rmgr: Transaction len (rec/tot):     62/    62, tx:        752, lsn: 0/0403FBC0, prev 0/0403FB88, desc: INVALIDATION ;
inval msgs: catcache 11 catcache 10
rmgr: Transaction len (rec/tot):     82/    82, tx:        752, lsn: 0/0403FC00, prev 0/0403FBC0, desc: COMMIT
2023-10-14 13:35:21.955043 MSK; inval msgs: catcache 11 catcache 10
rmgr: XLOG        len (rec/tot):     49/  8241, tx:          0, lsn: 0/0403FC58, prev 0/0403FC00, desc: FPI_FOR_HINT ,
blkref #0: rel 1664/0/1260 fork fsm blk 2 FPW
rmgr: XLOG        len (rec/tot):     49/  8241, tx:          0, lsn: 0/04041CA8, prev 0/0403FC58, desc: FPI_FOR_HINT ,
blkref #0: rel 1664/0/1260 fork fsm blk 1 FPW
rmgr: XLOG        len (rec/tot):     49/  8241, tx:          0, lsn: 0/04043CF8, prev 0/04041CA8, desc: FPI_FOR_HINT ,
blkref #0: rel 1664/0/1260 fork fsm blk 0 FPW
rmgr: Heap        len (rec/tot):    225/   225, tx:          0, lsn: 0/04045D48, prev 0/04043CF8, desc: INPLACE off: 26,
blkref #0: rel 1663/16384/16406 blk 4
rmgr: Transaction len (rec/tot):     78/    78, tx:          0, lsn: 0/04045E30, prev 0/04045D48, desc: INVALIDATION ;
inval msgs: catcache 55 catcache 54 relcache 1260
rmgr: Standby     len (rec/tot):     90/    90, tx:          0, lsn: 0/04045E80, prev 0/04045E30, desc: INVALIDATIONS ;
relcache init file inval dbid 16384 tsid 1663; inval msgs: catcache 55 catcache 54 relcache 1260

A successful run:
rmgr: Heap        len (rec/tot):     54/    54, tx:        752, lsn: 0/0403FB88, prev 0/0403FB48, desc: DELETE xmax:
752, off: 16, infobits: [KEYS_UPDATED], flags: 0x00, blkref #0: rel 1664/0/1260 blk 0
rmgr: Transaction len (rec/tot):     62/    62, tx:        752, lsn: 0/0403FBC0, prev 0/0403FB88, desc: INVALIDATION ;
inval msgs: catcache 11 catcache 10
rmgr: Transaction len (rec/tot):     82/    82, tx:        752, lsn: 0/0403FC00, prev 0/0403FBC0, desc: COMMIT
2023-10-14 14:21:47.254556 MSK; inval msgs: catcache 11 catcache 10
rmgr: Heap2       len (rec/tot):     57/    57, tx:          0, lsn: 0/0403FC58, prev 0/0403FC00, desc: PRUNE
snapshotConflictHorizon: 752, nredirected: 0, ndead: 1, nunused: 0, redirected: [], dead: [16], unused: [], blkref #0:
rel 1664/0/1260 blk 0
rmgr: Btree       len (rec/tot):     52/    52, tx:          0, lsn: 0/0403FC98, prev 0/0403FC58, desc: VACUUM ndeleted:
1, nupdated: 0, deleted: [1], updated: [], blkref #0: rel 1664/0/2676 blk 1
rmgr: Btree       len (rec/tot):     52/    52, tx:          0, lsn: 0/0403FCD0, prev 0/0403FC98, desc: VACUUM ndeleted:
1, nupdated: 0, deleted: [16], updated: [], blkref #0: rel 1664/0/2677 blk 1
rmgr: Heap2       len (rec/tot):     50/    50, tx:          0, lsn: 0/0403FD08, prev 0/0403FCD0, desc: VACUUM nunused:
1, unused: [16], blkref #0: rel 1664/0/1260 blk 0

I repeated the hack Thomas did before (added
"if (rand() % 10 == 0) return false;" in ConditionalLockBufferForCleanup())
and reproduced that failure without Valgrind and parallel.
The following changes:
@@ -562 +562 @@
-  VACUUM pg_class;
+  VACUUM FREEZE pg_class;
@@ -600 +600 @@
-  VACUUM pg_authid;
+  VACUUM FREEZE pg_authid;
@@ -638 +638 @@
-  VACUUM conflict_test;
+  VACUUM FREEZE conflict_test;
fixed the issue for me

Second change should protect from a similar failure [4]:
#   Failed test 'inactiveslot slot invalidation is logged with vacuum on pg_class'
#   at t/035_standby_logical_decoding.pl line 224.

(I decided to write about failures of 035_standby_logical_decoding in this
thread because of the same VACUUM issue.)

I also confirmed that all the other tests in src/test/recovery pass reliably
with the modified ConditionalLockBufferForCleanup (except for
027_stream_regress, where plans changed).

When I had overcome the failures with pg_authid/pg_class, I stumbled upon
other even more rare one (which occurs without modifying
ConditionalLockBufferForCleanup):
t/035_standby_logical_decoding.pl .. 42/?
#   Failed test 'inactiveslot slot invalidation is logged with on-access pruning'
#   at t/035_standby_logical_decoding.pl line 224.
But I think that's another case, which I'm going to investigate separately.

BTW, it looks like the comment:
# One way to produce recovery conflict is to create/drop a relation and
# launch a vacuum on pg_class with hot_standby_feedback turned off on the standby.
in scenario 3 is a copy-paste error.
Also, there are two "Scenario 4" in this test.

Long story short, VACUUM FREEZE should make skink greener, but still there
is another possibility for the 035 test to fail.

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2023-09-05%2009%3A47%3A44
[2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2023-10-13%2012%3A16%3A52
[3] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2023-09-15%2017%3A10%3A55
[4] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2023-08-27%2008%3A30%3A28

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Erki Eessaar 2023-10-15 07:22:59 Re: PostgreSQL domains and NOT NULL constraint
Previous Message Andrei Lepikhov 2023-10-15 05:40:36 Re: Asymmetric partition-wise JOIN