Re: BRIN FSM vacuuming questions

From: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: BRIN FSM vacuuming questions
Date: 2018-04-02 20:32:27
Message-ID: 20180402203227.ri6hazqtke6u7oa6@alvherre.pgsql
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Tom Lane wrote:
> I noticed that several places in brin_pageops.c do this sort of thing:
>
> if (extended)
> {
> Assert(BlockNumberIsValid(newblk));
> RecordPageWithFreeSpace(idxrel, newblk, freespace);
> FreeSpaceMapVacuum(idxrel);
> }
>
> ie they put *one* page into the index's free space map and then invoke a
> scan of the index's entire FSM to ensure that that info is bubbled up to
> the top immediately. I wonder how carefully this was thought through.
> We don't generally think that it's worth doing an FSM vacuum for a single
> page worth of free space.

Yeah, there was no precedent for doing that, but it was actually
intentional: there were enough cases where we would extend the relation,
only to have the operation fail for unrelated reasons (things like a
concurrent heap insertion), so it was possible to lose free space very
fast, and since BRIN indexes are slow to grow it would become
noticeable. We could have waited for FreeSpaceMapVacuum to occur
naturally, but this would cause the index to bloat until next vacuum,
which could happen a long time later. Also, the idea behind BRIN is
that the indexed tables are very large, so vacuuming is infrequent.

I also considered the fact that we can save the newly acquired page in
relcache's "target block", but I was scared of relying just on that
because AFAIR it's easily lost on relcache invalidations.

> We could make these operations somewhat cheaper, in the wake of 851a26e26,
> by doing
>
> - FreeSpaceMapVacuum(idxrel);
> + FreeSpaceMapVacuumRange(idxrel, newblk, newblk + 1);
>
> so that only the relevant subtree of the FSM gets scanned. But it seems
> at least as plausible to just drop the FreeSpaceMapVacuum calls entirely
> and let the next VACUUM fix up the map, like the entire rest of the
> system does. This is particularly true because the code is inconsistent:
> some places that do RecordPageWithFreeSpace follow it up with an immediate
> FreeSpaceMapVacuum, and some don't.

Hmm I analyzed the callsites carefully to ensure it wasn't possible to
bail out without vacuuming in the normal corner cases. Maybe you
spotted some gap in my analysis -- or worse, maybe I had to change
nearby code for unrelated reasons and broke it afterwards inadvertently.

> Or, perhaps, there's actually some method behind this madness and there's
> good reason to think that FreeSpaceMapVacuum calls in these specific
> places are worth the cost? If so, a comment documenting the reasoning
> would be appropriate.

Range-vacuuming the FSM seems a good idea for these places -- no need to
trawl the rest of the tree. I'm not too excited about dropping the FSM
vacuuming completely and waiting till regular VACUUM. (Why do we call
this FSM operation "vacuum"? It seems pretty confusing.)

I agree that brin_getinsertbuffer should have a better comment to
explain the intention (and that routine seems to carry the most guilt in
failing to do the FSM vacuuming).

> I'm also finding this code in brin_page_cleanup a bit implausible:
>
> /* Measure free space and record it */
> freespace = br_page_get_freespace(page);
> if (freespace > GetRecordedFreeSpace(idxrel, BufferGetBlockNumber(buf)))
> {
> RecordPageWithFreeSpace(idxrel, BufferGetBlockNumber(buf), freespace);
> return true;
> }
>
> because of the fact that GetRecordedFreeSpace will report a rounded-down
> result owing to the limited resolution of the FSM's numbers. If we
> assume that the result of br_page_get_freespace is always maxaligned,
> then by my math there's a 75% chance (on maxalign-8 machines; more if
> maxalign is 4) that this will think there's a discrepancy even when
> there is none. It seems questionable to me that it's worth checking
> GetRecordedFreeSpace at all. If it really is, we'd need to get the
> recorded free space, invoke RecordPageWithFreeSpace, and then see if
> the result of GetRecordedFreeSpace has changed in order to decide whether
> anything really changed. But that seems overly complicated, and again
> unlike anything done elsewhere. Also, the result value is used to
> decide whether FreeSpaceMapVacuum is needed, but I'm unconvinced that
> it's a good idea to skip said call just because the leaf FSM values did
> not change; that loses the opportunity to clean up any upper-level FSM
> errors that may exist.
>
> In short, I'd suggest making this RecordPageWithFreeSpace call
> unconditional, dropping the result value of brin_page_cleanup,
> and doing FreeSpaceMapVacuum unconditionally too, back in
> brin_vacuum_scan. I don't see a reason for brin to be unlike
> the rest of the system here.

As I recall, this was just trying to save unnecessary FSM updates, and
failed to make the connection with lossiness in GetRecordedFreeSpace.
But since FSM updates are quite cheap anyway (since they're not WAL
logged), I agree with your proposed fix.

> Lastly, I notice that brin_vacuum_scan's loop is coded like
>
> for (blkno = 0; blkno < RelationGetNumberOfBlocks(idxrel); blkno++)
> {
> ...
> }
>
> which means we're incurring an lseek kernel call for *each iteration*
> of the loop. Surely that's pretty inefficient.

Oops, this is just a silly oversight.

> Is there an intention here to be sure we examine every page even in
> the face of concurrent extensions? If so, we could code it like this:
>
> nblocks = RelationGetNumberOfBlocks(idxrel);
> for (blkno = 0; blkno < nblocks; blkno++)
> {
> ...
> if (blkno == nblocks-1)
> nblocks = RelationGetNumberOfBlocks(idxrel);
> }
>
> but I'd just as soon skip the recheck unless there's a really strong
> reason for it. Even with that, there's no guarantee somebody hasn't added
> another page before we ever get out of the function, so I'm unclear
> on what the point is.

Yeah, as I recall, this code is only there to cope with the server
crashing beforehand; I don't think we need the recheck.

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Anthony Iliopoulos 2018-04-02 20:38:06 Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Previous Message Bruce Momjian 2018-04-02 20:24:02 Re: BUG #15112: Unable to run pg_upgrade with earthdistance extension