| From: | Heikki Linnakangas <hlinnaka(at)iki(dot)fi> |
|---|---|
| To: | Jelte Fennema-Nio <postgres(at)jeltef(dot)nl> |
| Cc: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Jacob Champion <jacob(dot)champion(at)enterprisedb(dot)com> |
| Subject: | Re: Don't use the deprecated and insecure PQcancel in our frontend tools anymore |
| Date: | 2026-04-07 00:18:05 |
| Message-ID: | b0e6a640-5aed-469f-867a-cc820b046798@iki.fi |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On 17/03/2026 12:31, Jelte Fennema-Nio wrote:
> On Mon, 16 Mar 2026 at 10:57, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>> +1. With a little extra effort, the cancellation can be made abortable
>> too, so that you don't need to wait for the TCP timeout. I.e when
>> ResetCancelConn() is called, the cancellation thread can immediately
>> call PQcancelReset().
>
> I agree we could do that, but I don't think we should. Then we'd be
> getting into the exact situation where psql doesn't wait for an already
> in-flight cancel request to be processed by the server before sending
> the next query. i.e. while this would be fine if there's a network
> issue, it would be bad if the server is just slow to respond to the
> cancel request (e.g. because there's a PgBouncer in the middle that
> hasn't forwarded the request yet).
True, it can get messy fast. Hmm, perhaps you could check
PQcancelStatus() and only cancel the cancellation if it hasn't sent the
CancelRequest packet yet.
>> One a different topic, is there any guarantee on which thread will
>> receive the SIGINT? It matters because psql's cancel callback sometimes
>> calls longjmp(), which assumes that the signal handler is executed in
>> the main thread.
>
> Good point, I had thought about whether this mattered, but hadn't
> considered the callbacks. Attached is v7 that makes sure the signal is
> always handled by the main thread by blocking SIGINT before creating the
> cancel thread.
The more I look into the cancel handling in psql and other client
programs, the more I want to gouge my eyes out. It was pretty messy
before this patch, too, and now also we have an extra thread to worry about.
For example, let's look at what happens at COPY IN, starting from
HandleCopyResult(). It sets SetCancelConn(pset.db), and then calls
handleCopyIn(). handleCopyIn() calls sigsetjmp(), enables the longjmp in
the signal handler with "sigint_interrupt_enabled = true". and calls
fgets().
So we rely on longjmp() to get out of the fgets(). Is that safe? I
suppose, it's worked all these years. It's scary though, I would never
do that in new code. And then you add threads to the mix... is it still
safe in the presence of threads? glibc uses internal locks on FILE
objects to make them thread-safe; if you longjmp() in the middle of
fgets(), is the lock released? I suppose it doesn't matter if only one
thread ever accesses it, but ugh.
Because the signal handler longjmp()'d out, it never calls PQcancel()
(or wakes up the thread, with the patch) in that case, even though the
cancel connection is currently armed with SetCancelConn(). Is that
intentional? If you then immediately SIGINT again, then the signal
handler will PQcancel(). I guess there's some logic to that; if the
signal arrives while you're blocked on the input, sending the CopyEnd
probably still works. Perhaps is should be documented that the callback
is called first.
The cancel handling in wait_on_slots() in parallel_slot.c is surprising
in a different way. It already uses async libpq calls and has a select()
loop, but it still relies on the signal handler to do the cancellation.
And it arbitrarily PQcancel()s only one of the connections it waits on.
The comments you added in your patches help a lot, explaining the three
different ways that cancellation can be handled, thanks for that.
psql has a global variable, 'cancel_pressed', which is set in the signal
handler callback. Is it redundant with 'CancelRequested' that's also set
in the signal handler?
Some ideas on how to make this less confusing:
- Change pg_dump to use threads on Unix too.
- Extract the cancel_pipe stuff into a helper function, and reuse it in
pg_dump and in wait_on_slots()
- Make the callback and SetCancelConn() mutually exclusive, so that when
you call SetCancelConn(), it *replaces* the callback if any was set. To
re-install it, call SetCancelCallback(). This would make them more
symmetrical, and would remove the confusion on what happens with the
active cancel connection if the callback longjmp()s. The
SetCancelCallback()/ResetCancelCallback() calls would replace the
sigint_interrupt_enabled variable.
- Instead of SetCancelConn() and ResetCancelConn(), have functions like
"StartCancelInBackground(conn)" and "WaitBackgroundCancel()". The cancel
callback would then call StartCancelInBackground()".
Not sure how much these really help, but maybe something to try out.
Then there's the idea of switching to async libpq calls and select()
everywhere...
Another more radical idea: Could we move this functionality to libpq
itself? Imagine that we had a function like PQrequestCancel(PGconn
*conn) that just set a flag on the PGconn object to indicate that
cancellation has been requested. When that flag is set, all blocking
calls like PQexec() on the original connection drive the cancellation
connection and sending the cancel request, instead of (or in addition
to) polling the main connection. The new PQrequestCancel() would be safe
to call from the signal handler, but then main thread would do all the
work. That'd be very easy to use when applicable, but there's one pretty
serious problem: it wouldn't work with async libpq calls and select()
without more API changes. The internal cancel connection that's opened
would have a different fd from the one returned by PQsocket(), so the
application wouldn't know to poll it.
- Heikki
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Melanie Plageman | 2026-04-07 00:19:06 | Re: EXPLAIN: showing ReadStream / prefetch stats |
| Previous Message | Lukas Fittl | 2026-04-07 00:06:36 | Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc? |