[PATCH] Write Notifications Through WAL

From: Rishu Bagga <rishu(dot)postgres(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Joel Jacobson <joel(at)compiler(dot)org>, Arseniy Mukhin <arseniy(dot)mukhin(dot)dev(at)gmail(dot)com>, Matheus Alcantara <matheusssilv97(at)gmail(dot)com>
Subject: [PATCH] Write Notifications Through WAL
Date: 2025-09-25 01:12:21
Message-ID: CAK80=jjvgWJL5EdzH_Bxc37gn53PJaB5sDXB3VPfJuN7G1xL5Q@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello all,

creating a new thread here in continuation of [1], as the patch
has changed direction significantly from the subject there.

TLDR:


The attached patch here aims to increase performance of the NOTIFY
codepath, in the case of concurrent notifying transactions,
by eliminating the need for a global lock to be held
from PreCommit_Notify to the end of transaction.

We accomplish this
through the method inspired by Tom Lane here [2]. The high level is as
follows:

Write Path:

1. Write notification data into WAL in PreCommit_Notify. Store lsn in MyProc.
2. Acquire NotifyQueueLock in exclusive mode.
3. Write a compact entry in the notify queue, referencing the lsn in step 1.
4. Commit the transaction, and write the notify lsn in step 1, inside
the commit record.
5. Release NotifyQueueLock.

Read Path:

Readers look through the notification queue where they see just the
notify_data_lsn,
And then get the actual notify data from the WAL.

(End TLDR)

Addressing the concerns from the previous patch, and explaining
the latest patch:

On Wed, Sep 10, 2025 at 6:05 AM Matheus Alcantara
<matheusssilv97(at)gmail(dot)com> wrote:
> There is also some other tests failing, like isolation, regress and
> others.
These are passing in the new patch, after removing the
extraneous SignalBackends call.

On Wed, Sep 10, 2025 at 12:00 PM Arseniy Mukhin
<arseniy(dot)mukhin(dot)dev(at)gmail(dot)com> wrote:
> I agree that listeners don't need to check if the notify transaction
> was committed or not anymore, but it seems that listeners still need
> to wait until the notify transaction is completed before sending its
> notifications. I believe we should continue using XidInMVCCSnapshot as
> the current version does.



Since we hold NotifyQueueLock in exclusive mode while writing the compact
entry in the notification queue, and don’t release it until after
commit - it is impossible for a listener to come across a notification
that is not yet committed. In the case of
a backend crash after writing the notification into the queue and
before committing,
we would be in a critical section, so even this is not a scenario
where we might read
an uncommitted notification.



> 2) Now we have two calls to SignalBackends (is it intentional?). The
> first in AtCommit_Notify() which seems correct to me. The second in
> XactLogCommitRecord() seems too early, because other backends can't
> process new notifications at this point (if the point about
> XidInMVCCSnapshot is correct). And there is Assert failure (Matheus'
> report is about it).



fixed this in the new patch (removed SignalBackends call in xact.c

> 3) I think we still need to check if the queue is full or not while
> adding new entries and ereport if it is (entries are smaller now, but
> we still can run out of space). And it seems to me that we should do
> this check before entering the critical section, because in the
> critical section it will result in PANIC. Master results in ERROR in
> case of the full queue, so should we do the same here?

Thanks for bringing this up - in the new patch I have added a check to make sure
we have enough space in the queue before entering the critical section.

The logic is as follows - in PreCommit_Notify, before writing the WAL
for notify
data, we do the following:
- Take NotifyQueueLock EXCLUSIVE.
- Look at the current head, tail, and how many entries are already
reserved by other backends.
- If adding one more would exceed:
- the allowed page window (max_notify_queue_pages), or
- the total entry cap (pages × entries per page),
then ERROR out: “queue is full”.

- If our reserved slot would be the last on the page, pre‑create the
next page so commit won’t stall.

- Increment reservedEntries (our claim) and drop the lock.

I’ve included a TAP test to ensure we error out if we exceed the max
notifications:
‎src/test/modules/test_listen_notify/t/002_queue_full.pl‎

> 4) I don't know, but It seems to me that XactLogCommitRecord is not
> the right place for adding listen/notify logic, it seems to be only
> about wal stuff.



Hmm, I understand the concern, but we do need to ensure the commit WAL
record and the notify queue entry are done atomically, and we need to
hold
the lock for the shortest time possible. So I think maybe we can
make a tradeoff
here for performance over code compartmentalization?



> 5) Do we want to use NotifyQueueLock as a lock that provides the
> commit order? Maybe we can introduce another lock to avoid unnecessary
> contantion? For example, if we use NotifyQueueLock, we block listeners
> that want to read new notifications while we are inserting a commit
> record, which seems unnecessary.

I think we should stick to using this lock, as we write to the notify
queue while holding this lock. We aren’t holding it for a particularly
long time,
so this should be okay. Additionally this allows listeners
to assume entries in
the queue are from committed transactions.

In
the current implementation we also use this lock for writing to the
queue.

> 6) We have fixed size queue entries, so I think we don't need this
> "padding" logic at the end of the page anymore, because we know how
> many entries we can have on each page.

Thanks, I've removed this in the new patch.

On Wed, Sep 10, 2025 at 4:31 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote



> With your patch, since the backends get the notification by reading
> WAL records do we need to prevent WAL records that potentially have
> unconsumed notification from being removed by the checkpointer? Or we
> can save unconsumed notifications in WAL records to somewhere during
> the checkpoint as we do for 2PC transactions.



Yes we will need to prevent notify data in the WAL from being removed by
checkpointer. I have come up with the following “WAL pinning” solution
to make sure we
do not remove necessary WALs:

1. Maintain a notify_data lsn array of MAX_BACKENDS for uncommitted
notify data.
It is possible that the LSN for the data of an uncommitted
notification is lower than the min lsn across the committed entries.
If we only look at the committed data, and remove WAL records below
the min lsn there, and we eventually commit the uncommitted
transaction, then we won’t be able to read its notification data.

When we enter PreCommit_Notify,
- grab NotifyQueueLock in exclusive mode,
- Log the notification record and get the notify data lsn.
- Find a slot in theUncommittedNotifies array, and record the notify
data lsn, along with xid.
- If we come across any entries where the transaction is no longer in
progress, set their slots as not in use, and the lsn as
InvalidXlogRecPtr.

- Release the NotifyQueueLock

2. Maintain a min notify_data LSN array of length max_notify_pages, to track the
Minimum lsn and corresponding xact needed for notify data across
currently active pages in the notify queue.

- When we advance the tail past a page, we set the min lsn for the
page to be InvalidXlogRecPtr.
- When we add an entry into the notify queue, check against the current min for
the page, and if lower, then update the min for the page.
- Remove the current lsn / xid from the Uncommitted LSNs list.

3. In RemoveOldXlogFiles (where we recycle WAL logs)

- call AsyncNotifyOldestRequiredLSN(&notify_oldest), which will do the
following:
- grab NotifiyQueueLock in exclusive mode
 - check the min lsn
across both committed and uncommitted in flight notifications
- store the min lsn needed in notify_oldest
- prune any invalid entries in both lists while iterating through them.

- then we get the segment of notify_oldest, and if it is > 0, and less
than the currently
computed cutoff segment, set cutoff_seg = oldest_notify_seg - 1, so we maintain
the segment with the oldest needed notify data lsn.

I’ve included a TAP test that ensures we pin necessary WALs:
src/test/modules/test_listen_notify/t/003_wal_pin_test.pl.

PERFORMANCE:

Running the following, (pgbench_transaction_notify.sql is attached.)

pgbench -c 100 -T 100 -j 8 -f pgbench_transaction_notify.sql -d postgres
I consistently see TPS is 60k - 70k.

Without the patch, the TPS is ~30k.

Additionally with

pgbench -c 99 -T 100 -j 8 -f
pgbench_transaction_notify.sql -d postgres

and 1 listener, we see
similar TPS, and no missing WAL errors, so the listen and
WAL
retention logic seems correct.

> Also, could you add this patch to the next commit fest if not yet?

Will do!

Thanks for all the feedback on this patch - and apologies for the
delay in coming
up with the new patch, as WAL retention was a bit more complicated
than
I initially imagined it to be.

The next steps for me are to also test replication and recovery of
notifications.


Please let me know what I need to do to get this into
a committable state.

Thanks,
Rishu



[1] https://www.postgresql.org/message-id/flat/CAK80%3DjiWjMAfjwJymPoG3%3DFNE%3DhHygn5dXaoax1BWQT1rJrCCw%40mail.gmail.com

[2] https://www.postgresql.org/message-id/1878165.1752858390%40sss.pgh.pa.us

Attachment Content-Type Size
notify-through-wal-v5.patch application/octet-stream 73.6 KB
pgbench_transaction_notify.sql application/octet-stream 94 bytes

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Hayato Kuroda (Fujitsu) 2025-09-25 01:39:43 Remove obsolate comments from 047_checkpoint_physical_slot
Previous Message Richard Guo 2025-09-25 01:01:22 Re: Inconsistent Behavior of GROUP BY ROLLUP in v17 vs master