| From: | Andres Freund <andres(at)anarazel(dot)de> |
|---|---|
| To: | Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru> |
| Cc: | Melanie Plageman <melanieplageman(at)gmail(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Kirill Reshke <reshkekirill(at)gmail(dot)com>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com> |
| Subject: | Re: Buffer locking is special (hints, checksums, AIO writes) |
| Date: | 2026-03-31 22:05:46 |
| Message-ID: | dsmizugtxgrp7zgkndnr6gl5sszy7ckbjmffsacfgstajrpjid@kjoodzobjcex |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi,
On 2026-03-31 19:02:33 +0300, Yura Sokolov wrote:
> 27.03.2026 23:00, Andres Freund wrote:
> > On 2026-03-25 18:35:55 -0400, Andres Freund wrote:
> >> Running it through valgrind and then will work on reading through one more
> >> time and pushing them.
> >
> > And done.
> >
> > Phew, this project took way longer than I'd though it'd take.
>
> In addition to bug with BM_IO_ERROR [1] , I found race condition in
> PinBuffer in this lines of code:
>
> if (unlikely(skip_if_not_valid && !(old_buf_state & BM_VALID)))
> return false;
>
> /*
> * We're not allowed to increase the refcount while the buffer
> * header spinlock is held. Wait for the lock to be released.
> */
> if (old_buf_state & BM_LOCKED)
> old_buf_state = WaitBufHdrUnlocked(buf);
>
> While we waited for buffer header for being unlocked, it may become
> invalid, isn't it?
> Therefore, check related to skip_if_not_valid have to happen after waiting.
Yea, that does seem wrong. Not sure how it ended up that way.
I think it may be better to add a continue after the WaitBufHdrUnlocked(), so
that we restart the loop, rather than moving the skip_if_not_valid check.
> ....
>
> Another question: previously we had to wait for buffer for being unlocked
> because UnlockBufHdr wrote to buf->state unconditionally, therefore our pin
> increment could be lost.
> Now UnlockBufHdr and UnlockBufHdrExt does proper atomic operations and
> preserves concurrent changes. Are we still need to wait?
Yes.
> Most of time PinBuffer is called protected by BufTable's partition LWLock,
> therefore buffer may not be changed in dramatic way.
I don't think the partition locks are sufficient protection for everything. We
have a few places in the code that want to be able to modify the buffer state
depending on whether the buffer is already pinned, and I don't think all of
them currently hold the relevant buffer mapping partition's lock. If pinning
were not to wait for an existing header lock, such checks would not easily be
doable.
Perhaps we could fix all the relevant places by acquiring the partition lock
in a few more places. But I think that'd be going in exactly the opposite
direction we should to. The partition locks are quite contended locks and we
should work on getting rid of them eventually. Building them into the
protection model seems quite unwise.
I think many of the places that currently do rely on the buffer header
spinlock can be converted to CAS loops.
I'm not really sure how much that's worth though - the WaitBufHdrLocked() in
PinBuffer() is pretty hard to hit in realistic workloads. What would be really
nice, is to be able to replace the CAS() with an atomic add (since those are
considerably faster), but that's not really possible regardless of the need
for WaitBufHdrLocked(), because we can't just add BUF_USAGECOUNT_ONE, as that
would allow increasing the usage count too far.
I would like to eventually narrow the definition of the buffer header spinlock
to just be about the "identity" of the buffer, which then would only be needed
by things like DropRelationBuffers() and when changing the buffer's identity.
> But call in ReadRecentBuffer is the exception. It is not protected by
> partition lock and have to make additional checks. That is why you
> introduced skip_if_not_valid.
>
> Does optimization of ReadRecentBuffer pays for WaitBufHdrUnlocked?
As mentioned above, I don't think it's just ReadRecentBuffer that relies on
the buffer header spinlock preventing new pins.
Greetings,
Andres Freund
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Matheus Alcantara | 2026-03-31 22:10:08 | Re: Add custom EXPLAIN options support to auto_explain |
| Previous Message | Roberto Mello | 2026-03-31 21:54:51 | Re: pg_publication_tables: return NULL attnames when no column list is specified |