Re: Call lazy_check_wraparound_failsafe earlier for parallel vacuum

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: "Imseih (AWS), Sami" <simseih(at)amazon(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Call lazy_check_wraparound_failsafe earlier for parallel vacuum
Date: 2022-11-09 21:42:43
Message-ID: CAH2-WznOEQCO-nyTHRn9de2J16+n1XCvbBnnt+om2ZKUpTSvDg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Nov 9, 2022 at 6:29 AM Imseih (AWS), Sami <simseih(at)amazon(dot)com> wrote:
> When a user is running a parallel vacuum and the vacuum is long running
>
> due to many large indexes, it would make sense to check for failsafe earlier.

It makes sense to prefer consistency here, I suppose. The reason why
we're not consistent is because it was easier not to be, which isn't
exactly the best reason (nor the worst).

I don't think that it's obvious that we need to call the failsafe at
any particular frequency. There is probably an argument to be made for
the idea that we're not checking frequently enough (even in the common
serial VACUUM case), just as there is an argument to be made for the
opposite idea. It's not like there is some simple linear relationship
(or any kind of relationship) between the amount of physical work
performed by one VACUUM operation, and the rate of XID consumption by
the system as a whole. And so the details of how we do it have plenty
to do with what we can afford to do.

My gut instinct is that the most important thing is to at least call
lazy_check_wraparound_failsafe() once per index scan. Multiple index
scans are disproportionately involved in VACUUMs that take far longer
than expected, which are presumably the kind of VACUUMs that tend to
be running when the failsafe actually triggers. We don't really expect
the failsafe to trigger, so presumably when it actually does things haven't
been going well for some time. (Index corruption that prevents forward
progress on one particular index is another example.)

That said, one thing that does bother me in this area occurs to me: we
call lazy_check_wraparound_failsafe() from lazy_scan_heap() (before we
get to the index scans that you're talking about) at an interval that
is based on how many heap pages we've either processed (and recorded
as a scanned_pages page) *or* have skipped over using the visibility
map. In other words we use blkno here, when we should really be using
scanned_pages instead:

if (blkno - next_failsafe_block >= FAILSAFE_EVERY_PAGES)
{
lazy_check_wraparound_failsafe(vacrel);
next_failsafe_block = blkno;
}

This code effectively treats pages skipped using the visibility map as
equivalent to pages physically scanned (scanned_pages), even though
skipping requires essentially no work at all. That just isn't logical,
and feels like something worth fixing. The fundamental unit of work in
lazy_scan_heap() is a *scanned* heap page.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2022-11-09 22:08:03 Re: HOT chain validation in verify_heapam()
Previous Message Tom Lane 2022-11-09 21:05:05 Re: Expand palloc/pg_malloc API