Re: lazy_scan_heap() should release lock on buffer before vacuuming FSM

From: Andres Freund <andres(at)anarazel(dot)de>
To: Melanie Plageman <melanieplageman(at)gmail(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: lazy_scan_heap() should release lock on buffer before vacuuming FSM
Date: 2023-11-15 00:15:29
Message-ID: 20231115001529.hmmibs7h7hbiazsj@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2023-11-14 07:46:10 -0500, Melanie Plageman wrote:
> > FreeSpaceMapVacuumRange()'s comment says:
> > * As above, but assume that only heap pages between start and end-1 inclusive
> > * have new free-space information, so update only the upper-level slots
> > * covering that block range. end == InvalidBlockNumber is equivalent to
> > * "all the rest of the relation".
> >
> > So FreeSpaceMapVacuumRange(..., blkno) will not actually process the "effects"
> > of the RecordPageWithFreeSpace() above it - which seems confusing.
>
> Ah, so shall I pass blkno + 1 as end?

I think there's no actual overflow danger, because MaxBlockNumber + 1 is
InvalidBlockNumber, which scans the rest of the relation (i.e. exactly the
intended block). Perhaps worth noting?

> > =# DELETE FROM copytest_0;
> > =# VACUUM (VERBOSE) copytest_0;
> > ...
> > INFO: 00000: table "copytest_0": truncated 146264 to 110934 pages
> > ...
> > tuples missed: 5848 dead from 89 pages not removed due to cleanup lock contention
> > ...
> >
> > A bit of debugging later I figured out that this is due to the background
> > writer. If I SIGSTOP bgwriter, the whole relation is truncated...
>
> That's a bit sad. But isn't that what you would expect bgwriter to do?

I mainly noted this so that if somebody else tries this they don't also spent
30 minutes being confused. I'm not quite sure what a good solution here would
be.

> Write out dirty buffers? It doesn't know that those pages consist of
> only dead tuples and that you will soon truncate them.

I think the issue more that it's feels wrong that a pin by bgwriter blocks
vacuum cleaning up. I think the same happens with on-access pruning - where
it's more likely for bgwriter to focus on those pages. Checkpointer likely
causes the same. Even normal backends can cause this while writing out the
page.

It's not like bgwriter/checkpointer would actually have a problem if the page
were pruned - the page is locked while being written out and neither keep
pointers into the page after unlocking it again.

At this point I started to wonder if we should invent a separate type of pin
for a buffer undergoing IO. We basically have the necessary state already:
BM_IO_IN_PROGRESS. We'd need to look at that in some places, e.g. in
InvalidateBuffer(), instead of BUF_STATE_GET_REFCOUNT(), we'd need to also
look at BM_IO_IN_PROGRESS.

Except of course that that'd not change anything -
ConditionalLockBufferForCleanup() locks the buffer conditionally, *before*
even looking at the refcount and returns false if not. And writing out a
buffer takes a content lock. Which made me realize that
"tuples missed: 5848 dead from 89 pages not removed due to cleanup lock contention"

is often kinda wrong - the contention is *not* cleanup lock specific. It's
often just plain contention on the lwlock.

Perhaps we should just treat IO_IN_PROGRESS buffers differently in
lazy_scan_heap() and heap_page_prune_opt() and wait?

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2023-11-15 00:22:31 Re: meson documentation build open issues
Previous Message Jeff Davis 2023-11-15 00:13:49 Re: Why do indexes and sorts use the database collation?