Add missing ConditionVariableCancelSleep() in slot.c

From: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Add missing ConditionVariableCancelSleep() in slot.c
Date: 2024-04-13 09:04:00
Message-ID: CALj2ACUkm5kSxhTbFccVCe6k6ncGktbNDgNOTNSV295K=DfDrw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

It looks like there's missing ConditionVariableCancelSleep() in
InvalidatePossiblyObsoleteSlot() after waiting for the replication
slot to be released by another process. Although prepare to sleep
cancels the sleep if the previous one wasn't canceled, the calling
process still remains in the CV's wait queue for the last cycle after
the replication slot holding process releases the slot. I'm wondering
why there's no issue detected if the calling process isn't removed
from the CV's wait queue. It may be due to the cancel sleep call in
the sigsetjmp() path for almost every process. IMO, it's a clean
practice to cancel the sleep immediately after the sleep ends like
elsewhere.

Please find the attached patch to fix the issue. Alternatively, we can
just add ConditionVariableCancelSleep() outside of the for loop to
cancel the sleep of the last cycle if any. This also looks correct
because every prepare to sleep does ensure the previous sleep is
canceled, and if no sleep at all, the cacnel sleep call exits.

PS: I've encountered an assertion failure [1] some time back
sporadically in t/001_rep_changes.pl which I've reported in
https://www.postgresql.org/message-id/CALj2ACXO6TJ4rhc3Uz3MWJGob9e4H1C71qufH-DGKJh8k4QGZA%40mail.gmail.com.
I'm not so sure if this patch fixes the assertion failure. I ran the
tests for 100 times [2] on my dev system, I didn't see any issue with
the patch.

[1]
t/001_rep_changes.pl

2024-01-31 12:24:38.474 UTC [840166]
pg_16435_sync_16393_7330237333761601891 STATEMENT:
DROP_REPLICATION_SLOT pg_16435_sync_16393_7330237333761601891 WAIT
TRAP: failed Assert("list->head != INVALID_PGPROCNO"), File:
"../../../../src/include/storage/proclist.h", Line: 101, PID: 840166
postgres: publisher: walsender ubuntu postgres [local]
DROP_REPLICATION_SLOT(ExceptionalCondition+0xbb)[0x55c8edf6b8f9]
postgres: publisher: walsender ubuntu postgres [local]
DROP_REPLICATION_SLOT(+0x6637de)[0x55c8edd517de]
postgres: publisher: walsender ubuntu postgres [local]
DROP_REPLICATION_SLOT(ConditionVariablePrepareToSleep+0x85)[0x55c8edd51b91]
postgres: publisher: walsender ubuntu postgres [local]
DROP_REPLICATION_SLOT(ReplicationSlotAcquire+0x142)[0x55c8edcead6b]
postgres: publisher: walsender ubuntu postgres [local]
DROP_REPLICATION_SLOT(ReplicationSlotDrop+0x51)[0x55c8edceb47f]
postgres: publisher: walsender ubuntu postgres [local]
DROP_REPLICATION_SLOT(+0x60da71)[0x55c8edcfba71]
postgres: publisher: walsender ubuntu postgres [local]
DROP_REPLICATION_SLOT(exec_replication_command+0x47e)[0x55c8edcfc96a]
postgres: publisher: walsender ubuntu postgres [local]
DROP_REPLICATION_SLOT(PostgresMain+0x7df)[0x55c8edd7d644]
postgres: publisher: walsender ubuntu postgres [local]
DROP_REPLICATION_SLOT(+0x5ab50c)[0x55c8edc9950c]
postgres: publisher: walsender ubuntu postgres [local]
DROP_REPLICATION_SLOT(+0x5aab21)[0x55c8edc98b21]
postgres: publisher: walsender ubuntu postgres [local]
DROP_REPLICATION_SLOT(+0x5a70de)[0x55c8edc950de]
postgres: publisher: walsender ubuntu postgres [local]
DROP_REPLICATION_SLOT(PostmasterMain+0x1534)[0x55c8edc949db]
postgres: publisher: walsender ubuntu postgres [local]
DROP_REPLICATION_SLOT(+0x459c47)[0x55c8edb47c47]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f19fe629d90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f19fe629e40]
postgres: publisher: walsender ubuntu postgres [local]
DROP_REPLICATION_SLOT(_start+0x25)[0x55c8ed7c4565]
2024-01-31 12:24:38.476 UTC [840168]
pg_16435_sync_16390_7330237333761601891 LOG: statement: SELECT
a.attnum, a.attname, a.atttypid, a.attnum =
ANY(i.indkey) FROM pg_catalog.pg_attribute a LEFT JOIN
pg_catalog.pg_index i ON (i.indexrelid =
pg_get_replica_identity_index(16391)) WHERE a.attnum >
0::pg_catalog.int2 AND NOT a.attisdropped AND a.attgenerated = ''
AND a.attrelid = 16391 ORDER BY a.attnum

[2] for i in {1..100}; do make check
PROVE_TESTS="t/001_rep_changes.pl"; if [ $? -ne 0 ]; then echo "The
command failed on iteration $i"; break; fi; done

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment Content-Type Size
v1-0001-Add-missing-ConditionVariableCancelSleep-in-slot..patch application/octet-stream 889 bytes

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bharath Rupireddy 2024-04-13 09:10:00 Remove unnecessary segment number calculation after wal_removed invalidation of replication slots
Previous Message Andres Freund 2024-04-13 08:36:35 Re: Parallel CREATE INDEX for BRIN indexes