Re: Optimize LISTEN/NOTIFY

From: "Joel Jacobson" <joel(at)compiler(dot)org>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Matheus Alcantara" <matheusssilv97(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Optimize LISTEN/NOTIFY
Date: 2025-10-11 06:43:51
Message-ID: 9eba307f-f2fb-48f0-9507-2e197f39ef9e@app.fastmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Oct 10, 2025, at 20:46, Joel Jacobson wrote:
> On Wed, Oct 8, 2025, at 20:46, Tom Lane wrote:
>> "Joel Jacobson" <joel(at)compiler(dot)org> writes:
>>> On Tue, Oct 7, 2025, at 22:15, Tom Lane wrote:
>>>> 5. ChannelHashAddListener: "already registered" case is not reached,
>>>> which surprises me a bit, and neither is the "grow the array" stanza.
>>
>>> I've added a test for the "grow the array" stanza.
>>
>>> The "already registered" case seems impossible to reach, since the
>>> caller, Exec_ListenCommit, returns early if IsListeningOn.
>>
>> Maybe we should remove the check for "already registered" then,
>> or reduce it to an Assert? Seems pointless to check twice.
>>
>> Or thinking a little bigger: why are we maintaining the set of
>> channels-listened-to both as a list and a hash? Could we remove
>> the list form?
>
> Yes, it was indeed possible to remove the list form.
>
> Some functions got a bit more complex, but I think it's worth it since a
> single source of truth seems like an important design goal.
>
> This also made LISTEN faster when a backend is listening on plenty of
> channels, since we can now lookup the channel in the hash, instead of
> having to go through the list as before. The additional linear scan of
> the listenersArray didn't add any noticeable extra cost even with
> thousands of listening backends for the channel.
>
> I also tried to keep listenersArray sorted and binary-search it, but
> even with thousands of listening backends, I couldn't measure any
> overall latency difference of LISTEN, so I kept the linear scan to keep
> it simple.
>
> In Exec_ListenCommit, I've now inlined code that is similar to
> IsListeningOn. I didn't want to use IsListeningOn since it felt wasteful
> having to do dshash_find, when we instead can just use
> dshash_find_or_insert, to handle both cases.
>
> I also added a static int numChannelsListeningOn variable, to avoid the
> possibly expensive operation of going through the entire hash, to be
> able to check `numChannelsListeningOn == 0` instead of the now removed
> `listenChannels == NIL`. It's of course critical to keep
> numChannelsListeningOn in sync, but I think it should be safe? Would of
> course be better to avoid this variable. Maybe the extra cycles that
> would cost would be worth it?

In addition to previously suggested optimization, there is another major
one that seems doable, that would mean a great improvement for workload
having large traffic differences between channels, i.e. some low traffic
and some high traffic.

I'm not entirely sure this approach is correct though, I've might
misunderstood the guarantees of the heavyweight lock. My assumption is
based on that there can only be one backend that is currently running
the code in PreCommit_Notify after having aquired the heavyweight lock.
If this is not true, then it doesn't work. What made me worried is the
exclusive lock we also take inside the same function, I don't see the
point of it since we're already holding the heavyweight lock, but maybe
this is just to "allows deadlocks to be detected" like the comment says?

---

Patches:

* 0001-optimize_listen_notify-v14.patch:
Just adds additional test coverage of async.c

* 0002-optimize_listen_notify-v14.patch:
Adds the shared channel hash.
Unchanged since 0002-optimize_listen_notify-v13.patch.

* 0003-optimize_listen_notify-v14.patch:

Optimize LISTEN/NOTIFY by advancing idle backends directly

Building on the previous channel-specific listener tracking
optimization, this patch further reduces context switching by detecting
idle listening backends that don't listen to any of the channels being
notified and advancing their queue positions directly without waking
them up.

When a backend commits notifications, it now saves both the queue head
position before and after writing. In SignalBackends(), backends that
are at the old queue head and weren't marked for wakeup (meaning they
don't listen to any of the notified channels) are advanced directly to
the new queue head. This eliminates unnecessary wakeups for these
backends, which would otherwise wake up, scan through all the
notifications, skip each one, and advance to the same position anyway.

The implementation carefully handles the race condition where other
backends may write notifications after the heavyweight lock is released
but before SignalBackends() is called. By saving queueHeadAfterWrite
immediately after writing (before releasing the lock), we ensure
backends are only advanced over the exact notifications we wrote, not
notifications from other concurrent backends.

---

Benchmark:

% ./pgbench_patched --listen-notify-benchmark --notify-round-trips=10000 --notify-idle-step=10
pgbench_patched: starting LISTEN/NOTIFY round-trip benchmark
pgbench_patched: round-trips per iteration: 10000
pgbench_patched: idle listeners added per iteration: 10

master:

idle_listeners round_trips_per_sec max_latency_usec
0 33592.9 2278
10 14251.1 1041
20 9258.7 1367
30 6144.2 2277
40 4653.1 1690
50 3780.7 2869
60 3234.9 3215
70 2818.9 3652
80 2458.7 3219
90 2203.1 3505
100 1951.9 1739

0002-optimize_listen_notify-v14.patch:

idle_listeners round_trips_per_sec max_latency_usec
0 33936.2 889
10 30631.9 1233
20 22404.7 7862
30 19446.2 9539
40 16013.3 13963
50 14310.1 16983
60 12827.0 21363
70 11271.9 24775
80 10764.4 28703
90 9568.1 31693
100 9241.3 32724

0003-optimize_listen_notify-v14.patch:

idle_listeners round_trips_per_sec max_latency_usec
0 33236.8 1090
10 34681.0 1338
20 34530.4 1372
30 34061.6 1339
40 33084.5 913
50 33847.5 955
60 33675.8 1239
70 28857.4 20443
80 33324.9 786
90 33612.3 758
100 31259.2 7706

As we can see, with 0002, the ping-pong round-trips per second degrades
much slower than master, but the wakeup of idle listening backends still
needs to happen at some point, much fewer wakeups, and staggered over
time, but still makes it go down from 33k to 9k due to 100 idle
listening backends. With 0003, the round-trips per second is sustained,
unaffected by additional idle listening backends.

I've also attached the pgbench patch as a .txt in
pgbench-listen-notify-benchmark-patch.txt, since it's not part of this
patch, it's just provided to help others verify the results.

/Joel

Attachment Content-Type Size
0001-optimize_listen_notify-v14.patch application/octet-stream 7.8 KB
0002-optimize_listen_notify-v14.patch application/octet-stream 29.8 KB
0003-optimize_listen_notify-v14.patch application/octet-stream 5.0 KB
pgbench-listen-notify-benchmark-patch.txt text/plain 9.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Joel Jacobson 2025-10-11 07:43:22 Re: Optimize LISTEN/NOTIFY
Previous Message Amit Kapila 2025-10-11 06:04:29 Re: [PATCH TEST] Fix logical replication setup in subscription test `t/009_matviews.pl`