Re: Optimize LISTEN/NOTIFY

From: Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Joel Jacobson <joel(at)compiler(dot)org>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Optimize LISTEN/NOTIFY
Date: 2025-10-15 03:19:47
Message-ID: 1F7227F5-C33D-4E2C-8511-33F1468590D0@gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On Oct 15, 2025, at 05:19, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> "Joel Jacobson" <joel(at)compiler(dot)org> writes:
>> Having investigated this, the "direct advancement" approach seems
>> correct to me.
>
>> (I understand the exclusive lock in PreCommit_Notify on NotifyQueueLock
>> is of course needed because there are other operations that don't
>> acquire the heavyweight-lock, that take shared/exclusive lock on
>> NotifyQueueLock to read/modify QUEUE_HEAD, so the exclusive lock on
>> NotifyQueueLock in PreCommit_Notify is needed, since it modifies the
>> QUEUE_HEAD.)
>
> Right. What the heavyweight lock buys for us in this context is that
> we can be sure no other would-be notifier can insert any messages
> in between ours, even though we may take and release NotifyQueueLock
> several times to allow readers to sneak in. That in turn means that
> it's safe to advance readers over that whole set of messages if we
> know we didn't wake them up for any of those messages.
>
> There is a false-positive possibility if a reader was previously
> signaled but hasn't yet awoken: we will think that maybe we signaled
> it and hence not advance its pointer. This is an error in the safe
> direction however, and it will advance its pointer when it does
> wake up.
>
> A potential complaint is that we are doubling down on the need for
> that heavyweight lock, despite the upthread discussion about maybe
> getting rid of it for better scalability. However, this patch
> only requires holding a lock across all the insertions, not holding
> it through commit which I think is the true scalability blockage.
> If we did want to get rid of that lock, we'd only need to stop
> releasing NotifyQueueLock at insertion page boundary crossings,
> which I suspect isn't really that useful anyway. (In connection
> with that though, I think you ought to capture both the "before" and
> "after" pointers within that lock interval, not expend another lock
> acquisition later.)
>
> It would be good if the patch's comments made these points ...
> also, the comments above struct AsyncQueueControl need to be
> updated, because changing some other backend's queue pos is
> not legal under any of the stated rules.
>

I used to think “direct advancement” was a good idea. After reading Tom’s explanation, and reading v16 again carefully, now I also consider it’s adding complexity and could be fragile.

I just composed an example of race condition, please see if it is valid.

Because recoding queueHeadBeforeWrite and queueHeadAfterWrite happen in PreCommit_Notify() and checking them happens in AtCommit_Notify(), there is an interval in between, something may happen.

Say a listener A, it’s head pointing to 1.

And current QueueHead is 1.

Now two notifiers B and C are committing:
* B enters PreCommit_Notify(), it gets the NotifyQueueLock first, it records headBeforeWrite = 1 and writes to 3, and records headAfterWrite = 3.
* Now QueueHead is 3.
* C enters PreCommit_Notify(), it records headBeforeWrite = 3 and writes to 5, and records headAfterWrite = 5.
* Now QueueHead is 5
* C starts to run AtCommit_Notify(), as A’s head is 1, doesn’t equal to C’s headBeforeWrite, C won’t advance A’s head.
* A starts to run AtCommit_Notify(), A’s head equals to B’s beforeHeadWrite, B will advance A’s head to 3.
* At this time, QueueHead is 5, and A’s head is 3, so “direct advancement” will never work for A until A wakes up next time.

I am brainstorming. Maybe we can use a simpler strategy. If a backend’s queue lag exceeds a threshold, then wake it up. This solution is simpler and reliable, also reducing the total wake-up count.

>
>> Given all the experiments since my earlier message, here is a fresh,
>> self-contained write-up:
>
> I'm getting itchy about removing the local listenChannels list,
> because what you've done is to replace it with a shared data
> structure that can't be accessed without a good deal of locking
> overhead. That seems like it could easily be a net loss.
>
> Also, I really do not like this implementation of
> GetPendingNotifyChannels, as it looks like O(N^2) effort.
> The potentially large length of the list it builds is scary too,
> considering the comments that SignalBackends had better not fail.
> If we have to do it that way it'd be better to collect the list
> during PreCommit_Notify.
>

I agree with Tom that GetPendingNotifyChannels() is too heavy and unnecessary.

In PreCommit_Notify(), we can maintain a local hash table to record pending nofications’ channel names. dahash also supports hash table in local memory.

Then in SignalBackends(), we no longer need GetPendingNotifyChannels(), we can just iterate all keys of the local channel name hash.

And the local static numChannelsListeningOn is also not needed. We can get the count from the local hash.

WRT to v6, I got a few new comments:

1 - 0002
```
* After commit we are called another time (AtCommit_Notify()). Here we
- * make any actual updates to the effective listen state (listenChannels).
+ * make any actual updates to the effective listen state (channelHash).
* Then we signal any backends that may be interested in our messages
* (including our own backend, if listening). This is done by
- * SignalBackends(), which scans the list of listening backends and sends a
- * PROCSIG_NOTIFY_INTERRUPT signal to every listening backend (we don't
- * know which backend is listening on which channel so we must signal them
- * all). We can exclude backends that are already up to date, though, and
- * we can also exclude backends that are in other databases (unless they
- * are way behind and should be kicked to make them advance their
- * pointers).
+ * SignalBackends(), which consults the shared channel hash table to
+ * identify listeners for the channels that have pending notifications
+ * in the current database. Each selected backend is marked as having a
+ * wakeup pending to avoid duplicate signals, and a PROCSIG_NOTIFY_INTERRUPT
+ * signal is sent to it.
```

In this comment, you refer to “channelHash” and “the shared channel hash table”, they are the same thing, but easy to make readers to misunderstand.

2 - 0002
```
pg_listening_channels(PG_FUNCTION_ARGS)
{
FuncCallContext *funcctx;
+ List *listenChannels;

/* stuff done only on the first call of the function */
if (SRF_IS_FIRSTCALL())
{
+ MemoryContext oldcontext;
+ dshash_seq_status status;
+ ChannelEntry *entry;
+
/* create a function context for cross-call persistence */
funcctx = SRF_FIRSTCALL_INIT();
```

listenChannels is only used within the “if”, so it’s definition can be moved into the “if”.

3 - 0002
```
+ queue_length = asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
+ QUEUE_POS_PAGE(QUEUE_TAIL));
+
+ /* Check for lagging backends when the queue spans multiple pages */
+ if (queue_length > 0)
+ {
```

I wonder why this check is needed. If queue_length is 0, can we return immediately from SignalBackends()?

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Joel Jacobson 2025-10-15 03:22:59 Re: Optimize LISTEN/NOTIFY
Previous Message Aya Iwata (Fujitsu) 2025-10-15 02:48:43 RE: [PROPOSAL] Termination of Background Workers for ALTER/DROP DATABASE