From: | Arseniy Mukhin <arseniy(dot)mukhin(dot)dev(at)gmail(dot)com> |
---|---|
To: | Rishu Bagga <rishu(dot)postgres(at)gmail(dot)com> |
Cc: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Joel Jacobson <joel(at)compiler(dot)org>, Matheus Alcantara <matheusssilv97(at)gmail(dot)com> |
Subject: | Re: [PATCH] Write Notifications Through WAL |
Date: | 2025-10-08 10:07:42 |
Message-ID: | CAE7r3MKV7oewo6EGGLPMKQvv74ujPurjjDBU=K8KH8CYJStXaw@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
On Thu, Sep 25, 2025 at 4:12 AM Rishu Bagga <rishu(dot)postgres(at)gmail(dot)com> wrote:
>
> Hello all,
creating a new thread here in continuation of [1], as the patch
> has changed direction significantly from the subject there.
There are a lot of improvements, thank you for the new version! I'm
still in the process of reviewing it, but I would like to share
several points:
>
> On Wed, Sep 10, 2025 at 12:00 PM Arseniy Mukhin
> <arseniy(dot)mukhin(dot)dev(at)gmail(dot)com> wrote:
> > I agree that listeners don't need to check if the notify transaction
> > was committed or not anymore, but it seems that listeners still need
> > to wait until the notify transaction is completed before sending its
> > notifications. I believe we should continue using XidInMVCCSnapshot as
> > the current version does.
>
> Since we hold NotifyQueueLock in exclusive mode while writing the compact
> entry in the notification queue, and don’t release it until after
> commit - it is impossible for a listener to come across a notification
> that is not yet committed. In the case of
> a backend crash after writing the notification into the queue and
> before committing,
> we would be in a critical section, so even this is not a scenario
> where we might read
> an uncommitted notification.
Yes, we hold the lock while we are writing the commit record to wal,
but other transactions don't see our transaction as committed and
can't see changes that our transaction done besides notifications
(inserts/updates/deletes) right after we wrote the commit record.
There are a few more things that need to be done before it. IIUC the
moment when a fresh snapshot will show our transaction as 'not in
progress' is right after updating procarray (ProcArrayEndTransaction()
call in CommitTransaction()). So I think we still need
XidInMVCCSnapshot() check in reading path. Without it we can end up in
a situation where the listener got the notification but still can't
see changes notify transaction done because notification was processed
too early.
>
> > 3) I think we still need to check if the queue is full or not while
> > adding new entries and ereport if it is (entries are smaller now, but
> > we still can run out of space). And it seems to me that we should do
> > this check before entering the critical section, because in the
> > critical section it will result in PANIC. Master results in ERROR in
> > case of the full queue, so should we do the same here?
>
> Thanks for bringing this up - in the new patch I have added a check to make sure
> we have enough space in the queue before entering the critical section.
> The logic is as follows - in PreCommit_Notify, before writing the WAL
> for notify
> data, we do the following:
> - Take NotifyQueueLock EXCLUSIVE.
> - Look at the current head, tail, and how many entries are already
> reserved by other backends.
> - If adding one more would exceed:
> - the allowed page window (max_notify_queue_pages), or
> - the total entry cap (pages × entries per page),
> then ERROR out: “queue is full”.
>
> - If our reserved slot would be the last on the page, pre‑create the
> next page so commit won’t stall.
>
> - Increment reservedEntries (our claim) and drop the lock.
>
> I’ve included a TAP test to ensure we error out if we exceed the max
> notifications:
src/test/modules/test_listen_notify/t/002_queue_full.pl
I think reserving slots is a great idea. But I have a question about
checking if the queue is full. Why does PreCommit_Notify need three
checks? Isn't it enough to simply check that we don't exceed
max_total_entries?
> > 4) I don't know, but It seems to me that XactLogCommitRecord is not
> > the right place for adding listen/notify logic, it seems to be only
> > about wal stuff.
>
> Hmm, I understand the concern, but we do need to ensure the commit WAL
> record and the notify queue entry are done atomically, and we need to
> hold
the lock for the shortest time possible. So I think maybe we can
> make a tradeoff
> here for performance over code compartmentalization?
>
xl_notify.notify_lsn part looks good to me, but, IMHO, we can move
locking and adding to the queue logic one level up into
RecordTransactionCommit() and write it around the
XactLogCommitRecord() call.
> > 5) Do we want to use NotifyQueueLock as a lock that provides the
> > commit order? Maybe we can introduce another lock to avoid unnecessary
> > contantion? For example, if we use NotifyQueueLock, we block listeners
> > that want to read new notifications while we are inserting a commit
> > record, which seems unnecessary.
>
> I think we should stick to using this lock, as we write to the notify
> queue while holding this lock. We aren’t holding it for a particularly
> long time,
so this should be okay. Additionally this allows listeners
> to assume entries in
the queue are from committed transactions.
In
> the current implementation we also use this lock for writing to the
> queue.
Yeah, probably you are right, we are not holding it for long, so maybe
it's not necessary to introduce something else here.
>
> On Wed, Sep 10, 2025 at 4:31 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote
>
> > With your patch, since the backends get the notification by reading
> > WAL records do we need to prevent WAL records that potentially have
> > unconsumed notification from being removed by the checkpointer? Or we
> > can save unconsumed notifications in WAL records to somewhere during
> > the checkpoint as we do for 2PC transactions.
>
> Yes we will need to prevent notify data in the WAL from being removed by
> checkpointer. I have come up with the following “WAL pinning” solution
> to make sure we
do not remove necessary WALs:
>
> 1. Maintain a notify_data lsn array of MAX_BACKENDS for uncommitted
> notify data.
> It is possible that the LSN for the data of an uncommitted
> notification is lower than the min lsn across the committed entries.
> If we only look at the committed data, and remove WAL records below
> the min lsn there, and we eventually commit the uncommitted
> transaction, then we won’t be able to read its notification data.
>
> When we enter PreCommit_Notify,
> - grab NotifyQueueLock in exclusive mode,
> - Log the notification record and get the notify data lsn.
> - Find a slot in theUncommittedNotifies array, and record the notify
> data lsn, along with xid.
> - If we come across any entries where the transaction is no longer in
> progress, set their slots as not in use, and the lsn as
> InvalidXlogRecPtr.
> - Release the NotifyQueueLock
>
> 2. Maintain a min notify_data LSN array of length max_notify_pages, to track the
> Minimum lsn and corresponding xact needed for notify data across
> currently active pages in the notify queue.
>
> - When we advance the tail past a page, we set the min lsn for the
> page to be InvalidXlogRecPtr.
> - When we add an entry into the notify queue, check against the current min for
> the page, and if lower, then update the min for the page.
> - Remove the current lsn / xid from the Uncommitted LSNs list.
>
> 3. In RemoveOldXlogFiles (where we recycle WAL logs)
>
> - call AsyncNotifyOldestRequiredLSN(¬ify_oldest), which will do the
> following:
> - grab NotifiyQueueLock in exclusive mode
- check the min lsn
> across both committed and uncommitted in flight notifications
> - store the min lsn needed in notify_oldest
> - prune any invalid entries in both lists while iterating through them.
>
> - then we get the segment of notify_oldest, and if it is > 0, and less
> than the currently
> computed cutoff segment, set cutoff_seg = oldest_notify_seg - 1, so we maintain
> the segment with the oldest needed notify data lsn.
>
> I’ve included a TAP test that ensures we pin necessary WALs:
> src/test/modules/test_listen_notify/t/003_wal_pin_test.pl.
>
I don't have enough knowledge to say if it's the right approach to
prevent wal truncating with unread notifications. But there are two
things about current implementation I noticed. Not sure if it's worth
doing something with it before you know it's the right approach in
general. But anyway, there are two points about it:
We hold NotifyQueueLock while adding the notifications data record to
wal. IIUC one of the points of using wal here was that it lets us to
write notification data in parallel as wal is good for parallel
writes. But with lock it seems like we don't take advantage of it.
IIUC we hold the lock here to prevent truncation of the new record
after we wrote it and before we save its lsn. Is this correct? Maybe
we can solve it somehow differently? For example before writing the
notification data record we can save the current lsn (something like
XactLastRecEnd) in UncommittedNotifies so we can be sure anything
after it will not be truncated, including our record that we are gonna
write. This way we don't need to hold the lock.
We iterate over all entries in UncommittedNotifies every time we want
to add/remove data. What do you think about using MyProcNumber as an
array index instead of iteration as we do with listener positions for
instance? Sounds possible as every backend can't have more than one
uncommitted lsn. Also maybe backends can remove entries from
UncommittedNotifies in AtAbort_Notify() the same way committed
transactions do it in asyncQueueAddCompactEntry()? This way
AsyncNotifyOldestRequiredLSN() will not need to be bothered with
cleaning UncommittedNotifies and checking transaction statuses.
Thank you!
Best regards,
Arseniy Mukhin
From | Date | Subject | |
---|---|---|---|
Next Message | Dilip Kumar | 2025-10-08 10:08:18 | Re: Logical Replication of sequences |
Previous Message | Álvaro Herrera | 2025-10-08 10:04:04 | Re: pg_createsubscriber --dry-run logging concerns |