From: | "Joel Jacobson" <joel(at)compiler(dot)org> |
---|---|
To: | "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | "Chao Li" <li(dot)evan(dot)chao(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Optimize LISTEN/NOTIFY |
Date: | 2025-10-19 22:06:45 |
Message-ID: | b8df8040-1b15-4a1a-b5f6-74acebbd2d0c@app.fastmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Thu, Oct 16, 2025, at 22:16, Tom Lane wrote:
> "Joel Jacobson" <joel(at)compiler(dot)org> writes:
>> On Thu, Oct 16, 2025, at 20:16, Joel Jacobson wrote:
>>> Building pendingNotifyChannels is O(N^2) yes, but how large N is
>>> realistic here?
>
>> I agree this looks like a real problem, since I guess it's not
>> completely unthinkable someone might have
>> some kind of trigger on a table, that could fire off NOTIFY
>> for each row, possibly causing hundreds of thousands of
>> notifies in the same db txn.
>
> We already de-duplicate identical NOTIFY operations for exactly that
> reason (cf. AsyncExistsPendingNotify). However, non-identical NOTIFYs
> obviously can't be merged.
>
> I wonder whether we could adapt that de-duplication logic so that
> it produces a list of unique channel names in addition to a list
> of unique NOTIFY events. One way could be a list/hashtable of
> channels used, and for each one a list/hashtable of unique payloads,
> rather than the existing single-level list/hashtable.
Thanks for the great idea! Yes, this was indeed possible.
0002-optimize_listen_notify-v20.patch:
* Added channelHashtab field, created and updated together with hashtab.
If we have channelHashtab, it's used within PreCommit_Notify to
quickly build pendingNotifyChannelsl.
In this email, I'm also answering to the feedback from Arseniy Mukhin,
and I've based the alt1, alt2, alt3 .txt patches on top of v20.
On Sat, Oct 18, 2025, at 18:41, Arseniy Mukhin wrote:
> Thank you for the new version and all implementations!
Thanks for review and great ideas!
>> > I think we can perhaps salvage the idea if we invent a separate
>> > "advisory" queue position field, which tells its backend "hey,
>> > you could skip as far as here if you want", but is not used for
>> > purposes of SLRU truncation.
>>
>> Above idea is implemented in 0002-optimize_listen_notify-v19-alt1.txt
>
> pos = QUEUE_BACKEND_POS(i);
>
> /* Direct advancement for idle backends at the old head */
> if (pendingNotifies != NULL &&
> QUEUE_POS_EQUAL(pos, queueHeadBeforeWrite))
> {
> QUEUE_BACKEND_ADVISORY_POS(i) = queueHeadAfterWrite;
>
> If we have several notifying backends, it looks like only the first
> one will be able to do direct advancement here. Next notifying backend
> will fail on QUEUE_POS_EQUAL(pos, queueHeadBeforeWrite) as we don't
> wake up the listener and pos will be the same as it was for the first
> notifying backend.
Right.
> It seems that to accumulate direct advancement from
> several notifying backends we need to compare queueHeadBeforeWrite
> with advisoryPos here.
*** 0002-optimize_listen_notify-v20-alt1.txt:
* Fixed; compare advisoryPos with queueHeadBeforeWrite instead of pos.
> And we also need to advance advisoryPos to the
> listener's position after reading if advisoryPos falls behind.
* Fixed; set advisoryPos to max(max,advisoryPos) in PG_FINALLY block.
* Also noted Exec_ListenPreCommit didn't set advisoryPos to max
for the first LISTEN, now fixed.
> Minute of brainstorming
>
> I also thought about a workload that probably frequently can be met.
> Let's say we have sequence of notifications:
>
> F F F T F F F T F F F T
>
> Here F - notification from the channel we don't care about and T - the opposite.
> It seems that after the first 'T' notification it will be more
> difficult for notifying backends to do 'direct advancement' as there
> will be some lag before the listener reads the notification and
> advances its position. Not sure if it's a problem, probably it depends
> on the intensity of notifications.
Hmm, I realize both the advisoryPos and donePos ideas share a problem;
they both require listening backends to wakeup eventually anyway,
just to advance the 'pos'.
The holy grail would be to avoid this context switching cost entirely,
and only need to wakeup listening backends when they are actually
interested in the queued notifications. I think the third idea,
alt3, is most promising in achieving this goal.
> But maybe we can use a bit more
> sophisticated data structure here? Something like a list of skip
> ranges. Every entry in the list is the range (pos1, pos2) that the
> listener can skip during the reading. So instead of advancing
> advisoryPos every notifying backend should add skip range to the list.
> Notifying backends can merge neighbour ranges (pos1, pos2) & (pos2,
> pos3) -> (pos1, pos3). We also can limit the number of entries to 5
> for example. Listeners on their side should clear the list before
> reading and skip all ranges from it. What do you think? Is it
> overkill?
Hmm, maybe, but I'm a bit wary about too much complication.
Hopefully there is a simpler solution that avoids the need for this,
but sure, if we can't find one, then I'm positive to try this skip ranges idea.
>> > Alternatively, split the queue pos
>> > into "this is where to read next" and "this is as much as I'm
>> > definitively done with", where the second field gets advanced at
>> > the end of asyncQueueReadAllNotifications. Not sure which
>> > view would be less confusing (in the end I guess they're nearly
>> > the same thing, differently explained).
>>
>> Above idea is implemented in 0002-optimize_listen_notify-v19-alt2.txt
>>
>
> IMHO it's a little bit more confusing than the first option. Two
> points I noticed:
>
> 1) We have a fast path in asyncQueueReadAllNotifications()
>
> if (QUEUE_POS_EQUAL(pos, head))
> {
> /* Nothing to do, we have read all notifications already. */
> return;
> }
>
> Should we update donePos here? It looks like donePos may never be
> updated without it.
*** 0002-optimize_listen_notify-v20-alt2.txt:
* Fixed; update donePos here
> 2) In SignalBackends()
>
> /* Signal backends that have fallen too far behind */
> lag = asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
> QUEUE_POS_PAGE(pos));
>
> if (lag >= QUEUE_CLEANUP_DELAY)
> {
> pid = QUEUE_BACKEND_PID(i);
> Assert(pid != InvalidPid);
>
> QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
> pids[count] = pid;
> procnos[count] = i;
> count++;
> }
>
> Should we use donePos here as it is responsible for queue truncation now?
* Fixed; use donePos here
>> > A different line of thought could be to get rid of
>> > asyncQueueReadAllNotifications's optimization of moving the
>> > queue pos only once, per
>> >
>> > * (We could alternatively retake NotifyQueueLock and move the position
>> > * before handling each individual message, but that seems like too much
>> > * lock traffic.)
>> >
>> > Since we only need shared lock to advance our own queue pos,
>> > maybe that wouldn't be too awful. Not sure.
>>
>> Above idea is implemented in 0002-optimize_listen_notify-v19-alt3.txt
>>
>
> Hmm, it seems we still have the race when in the beginning of
> asyncQueueReadAllNotifications we read pos into the local variable and
> release the lock. IIUC to avoid the race without introducing another
> field here, the listener needs to hold the lock until it updates its
> position so that the notifying backend cannot change it concurrently.
*** 0002-optimize_listen_notify-v20-alt3.txt:
* Fixed; the shared 'pos' is now only updated if the new position is ahead.
To me, it looks like alt3 is the winner in terms of simplicity, and is
also the winner in my ping-pong benchmark, due to avoiding context
switches more effectively than alt1 and alt2.
Eager to hear your thoughts!
/Joel
Attachment | Content-Type | Size |
---|---|---|
0001-optimize_listen_notify-v20.patch | application/octet-stream | 9.3 KB |
0002-optimize_listen_notify-v20-alt1.txt | text/plain | 6.1 KB |
0002-optimize_listen_notify-v20-alt3.txt | text/plain | 7.6 KB |
0002-optimize_listen_notify-v20-alt2.txt | text/plain | 5.6 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Joel Jacobson | 2025-10-19 22:10:34 | Re: Optimize LISTEN/NOTIFY |
Previous Message | David E. Wheeler | 2025-10-19 20:37:42 | Re: abi-compliance-check failure due to recent changes to pg_{clear,restore}_{attribute,relation}_stats() |