Quick Links

Re: Is RecoveryConflictInterrupt() entirely safe in a signal handler?

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Is RecoveryConflictInterrupt() entirely safe in a signal handler?
Date:	2022-12-29 08:40:52
Message-ID:	20221229084052.GA1514070@rfd.leadboat.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Tue, Jul 26, 2022 at 07:22:52PM -0400, Tom Lane wrote:
> Thomas Munro <thomas(dot)munro(at)gmail(dot)com> writes:
> > ... The regular expression machinery is capable of
> > consuming a lot of CPU, and does CANCEL_REQUESTED(nfa->v->re)
> > frequently to avoid getting stuck. With the patch as it stands, that
> > would never be true.
>
> Surely that can't be too hard to fix. We might have to refactor
> the code around QueryCancelPending a little bit so that callers
> can ask "do we need a query cancel now?" without actually triggering
> a longjmp ... but why would that be problematic?

It could work. The problems are like those of making code safe to run in a
signal handler. You can use e.g. snprintf in rcancelrequested(), but you
still can't use palloc() or ereport(). I see at least these strategies:

1. Accept that recovery conflict checks run after a regex call completes.
2. Have rcancelrequested() return true unconditionally if we need a conflict
check. If there's no actual conflict, restart the regex.
3. Have rcancelrequested() run the conflict check, including elog-using
PostgreSQL code. On longjmp(), accept the leak of regex mallocs.
4. Have rcancelrequested() run the conflict check, including elog-using
PostgreSQL code. On longjmp(), escalate to FATAL.
5. Write the conflict check code to dutifully avoid longjmp().
6. Convert src/backend/regex to use palloc, so longjmp() is fine.

I would tend to pick (3). (6) could come later and remove the drawback of
(3). Does one of those unblock the patch, or not?

===

I found this thread because $SUBJECT is causing more buildfarm failures
lately. Here are just the ones with symptom "timed out waiting for match:
(?^:User was holding a relation lock for too long)":

I can reproduce that symptom reliably, on GNU/Linux, with the attached patch
adding sleeps. The key log bit:

2022-09-16 11:50:37.338 CEST [15022:4] 031_recovery_conflict.pl LOG: statement: BEGIN;
2022-09-16 11:50:37.339 CEST [15022:5] 031_recovery_conflict.pl LOG: statement: LOCK TABLE test_recovery_conflict_table1 IN ACCESS SHARE MODE;
2022-09-16 11:50:37.341 CEST [15022:6] 031_recovery_conflict.pl LOG: statement: SELECT 1;
2022-09-16 11:50:38.076 CEST [14880:17] LOG: recovery still waiting after 11.482 ms: recovery conflict on lock
2022-09-16 11:50:38.076 CEST [14880:18] DETAIL: Conflicting process: 15022.
2022-09-16 11:50:38.076 CEST [14880:19] CONTEXT: WAL redo at 0/34243F0 for Standby/LOCK: xid 733 db 16385 rel 16386
2022-09-16 11:50:38.196 CEST [15022:7] 031_recovery_conflict.pl FATAL: terminating connection due to conflict with recovery
2022-09-16 11:50:38.196 CEST [15022:8] 031_recovery_conflict.pl DETAIL: User transaction caused buffer deadlock with recovery.
2022-09-16 11:50:38.196 CEST [15022:9] 031_recovery_conflict.pl HINT: In a moment you should be able to reconnect to the database and repeat your command.
2022-09-16 11:50:38.197 CEST [15022:10] 031_recovery_conflict.pl LOG: disconnection: session time: 0:00:01.041 user=nm database=test_db host=[local]
2022-09-16 11:50:38.198 CEST [14880:20] LOG: recovery finished waiting after 132.886 ms: recovery conflict on lock

The second DETAIL should be "User was holding a relation lock for too long."
The backend in question is idle in transaction. RecoveryConflictInterrupt()
for PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK won't see IsWaitingForLock(),
so it will find no conflict. However, RecoveryConflictReason remains
clobbered, hence the wrong DETAIL message. Incidentally, the affected test
contains comment "# DROP TABLE containing block which standby has in a pinned
buffer". The standby holds no pin at that moment; the LOCK TABLE pins system
catalog pages, but it drops every pin it acquires.

Attachment	Content-Type	Size
repro-race-RecoveryConflictInterrupt-v0.patch	text/plain	2.2 KB

In response to

Re: Is RecoveryConflictInterrupt() entirely safe in a signal handler? at 2022-07-26 23:22:52 from Tom Lane

Responses

Re: Is RecoveryConflictInterrupt() entirely safe in a signal handler? at 2022-12-30 21:06:53 from Thomas Munro
Re: Is RecoveryConflictInterrupt() entirely safe in a signal handler? at 2023-01-04 22:16:50 from Andres Freund

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Amit Kapila	2022-12-29 13:24:53	Re: Perform streaming logical transactions by background workers and parallel apply
Previous Message	Nikita Malakhov	2022-12-29 07:39:34	Re: split TOAST support out of postgres.h