From: | "Joel Jacobson" <joel(at)compiler(dot)org> |
---|---|
To: | "Heikki Linnakangas" <hlinnaka(at)iki(dot)fi>, "Andres Freund" <andres(at)anarazel(dot)de>, "Thomas Munro" <thomas(dot)munro(at)gmail(dot)com> |
Cc: | "Michael Paquier" <michael(at)paquier(dot)xyz>, "Robert Haas" <robertmhaas(at)gmail(dot)com>, "Fujii Masao" <masao(dot)fujii(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Interrupts vs signals |
Date: | 2025-07-23 07:42:33 |
Message-ID: | d6cb422c-fed7-4227-bfa6-0b057b06f5b4@app.fastmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Thu, Mar 6, 2025, at 17:43, Heikki Linnakangas wrote:
> On 06/03/2025 02:47, Heikki Linnakangas wrote:
>> Here's a new patch set. It includes the previous work, and also goes the
>> whole hog and replaces procsignals and many other signalling with
>> interrupts. It's based on Thomas's v3-0002-Redesign-interrupts-remove-
>> ProcSignals.patch much earlier in this thread.
>
> And here's yet another version. It's the same at high level, but with a
> ton of little fixes.
>
> One notable change is that I merged storage/interrupt.[ch] with
> postmaster/interrupt.[ch], so the awkwardness of having two files with
> same name is gone.
>
> --
> Heikki Linnakangas
> Neon (https://neon.tech)
>
> Attachments:
> * v7-0001-Replace-Latches-with-Interrupts.patch
> * v7-0002-Fix-lost-wakeup-issue-in-logical-replication-laun.patch
> * v7-0003-Use-INTERRUPT_GENERAL-for-bgworker-state-change-n.patch
> * v7-0004-Replace-ProcSignals-and-other-ad-hoc-signals-with.patch
Great work in this thread. I'll try to help, definitively benchmarking,
but will also try to load the new design into my brain, to get the
correct mental model of it, so I can hopefully help with code review as
well.
I'm trying to figure out what all possible states a backend can be in.
With the new design, it basically seems like there are two different
types of bits? An Interrupt bit type, and a Sleeping bit. This gives
four possible states, if I'm not mistaken.
I say "bit type" since there are of course many different bits, for all
the existing interrupts, previously called "reasons" in the old design.
But all those type of interrupt bits, are just one type of bit. The
other type of bit seems to be the "sleeping bit".
I note how the InterruptType is overloaded to also be used for the
sleeping bit (SLEEPING_ON_INTERRUPTS).
Would it be correct to say that a backend can be in any of these four
states?
1. Interrupt clear, Sleeping clear: The backend is running its code and
is not waiting for anything.
2. Interrupt set, Sleeping clear: Interrupt pending. A sender has set
our interrupt bit, but we were still executing code. Crucially, the
sender saw the Sleeping bit was clear and thus did not send a physical
wakeup. The pending interrupt will be serviced at the next
CHECK_FOR_INTERRUPTS() call.
3. Interrupt clear, Sleeping set: Preparing to block. This is the
critical, short-lived state that solves the lost-wakeup problem. We have
just atomically set the Sleeping bit, advertising our intent to enter a
kernel wait. Any sender that sees this state knows it is now responsible
for sending a physical wakeup signal.
4. Interrupt set, Sleeping set: Wakeup in progress. This state confirms
the race was handled. A sender caught us in state 3, flipped our
Interrupt bit, saw our Sleeping bit was set, and sent the required
physical wakeup. We will enter the kernel wait but return immediately,
service the interrupt, and reset both bits.
If my mental model of your new design is correct, I was expecting to see similar
performance gains for LISTEN/NOTIFY, thanks to not unnecessarily interrupting
listening backends.
To test, I checked out an old git hash (d611f8b) from 2025-03-06, so I
could apply the v7 patch set.
I then ran one of the benchmark scripts that I've created for my work on
async.c.
It's a great win compared to master, but for some reason, your v7 patch
set is a bit slower on all benchmark configs I tested, except -j 1 -c 1,
where it's the same.
Connections=Jobs | TPS (patch-joel) | TPS (patch-v7) | Relative Diff (%) | StdDev (master) | StdDev (patch)
------------------+------------------+----------------+-------------------+-----------------+----------------
1 | 151510 | 151119 | -0.26% | 923 | 577
2 | 239051 | 227925 | -4.65% | 1596 | 638
4 | 250910 | 235238 | -6.25% | 4891 | 5628
8 | 171944 | 167271 | -2.72% | 2752 | 5182
16 | 165482 | 148305 | -10.38% | 2825 | 6447
32 | 145150 | 140216 | -3.40% | 1566 | 1867
64 | 131836 | 122617 | -6.99% | 573 | 253
128 | 121333 | 113668 | -6.32% | 874 | 981
(8 rows)
This might be due to other changes since Mars, since my patch is against
master.
When you've rebased, I can run the benchmarks again, and try to find the
culprit.
Btw, I really like the idea of a "stepping stone" approach, mentioned by
Thomas Munro in the other thread [1]. That is, looking at if we can find
ways to do smaller incremental changes, with a great ratio between
performance gains and additional complexity cost, for every committable
step.
/Joel
From | Date | Subject | |
---|---|---|---|
Next Message | Jim Jones | 2025-07-23 08:03:54 | Re: display hot standby state in psql prompt |
Previous Message | Bertrand Drouvot | 2025-07-23 07:33:09 | Re: Custom pgstat support performance regression for simple queries |