Re: should vacuum's first heap pass be read-only?

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: should vacuum's first heap pass be read-only?
Date: 2022-02-25 17:15:25
Message-ID: CAH2-WzkKM2sj8TajHyTb_QEikYVzGGatZgXH913SrnZk=xzCMw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Feb 25, 2022 at 5:06 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> Based on this discussion, IIUC, we are saying that now we will do the
> lazy_scan_heap every time like we are doing now. And we will
> conditionally skip the index vacuum for all or some of the indexes and
> then based on how much index vacuum is done we will conditionally do
> the lazy_vacuum_heap_rel(). Is my understanding correct?

I can only speak for myself, but that sounds correct to me. IMO what
we really want here is to create lots of options for VACUUM. To do the
work of index vacuuming when it is most convenient, based on very
recent information about what's going on in each index. There at some
specific obvious ways that it might help. For example, it would be
nice if the failsafe could not really skip index vacuuming -- it could
just put it off until later, after relfrozenxid has been advanced to a
safe value.

Bear in mind that the cost of lazy_scan_heap is often vastly less than
the cost of vacuuming all indexes -- and so doing a bit more work
there than theoretically necessary is not necessarily a problem.
Especially if you have something like UUID indexes, where there is no
natural locality. Many tables have 10+ indexes. Even large tables.

> IMHO, if we are doing the heap scan every time and then we are going
> to get the same dead items again which we had previously collected in
> the conveyor belt. I agree that we will not add them again into the
> conveyor belt but why do we want to store them in the conveyor belt
> when we want to redo the whole scanning again?

I don't think we want to, exactly. Maybe it's easier to store
redundant TIDs than to avoid storing them in the first place (we can
probably just accept some redundancy). There is a trade-off to be made
there. I'm not at all sure of what the best trade-off is, though.

> I think (without global indexes) the main advantage of using the
> conveyor belt is that if we skip the index scan for some of the
> indexes then we can save the dead item somewhere so that without
> scanning the heap again we have those dead items to do the index
> vacuum sometime in future

Global indexes are important in their own right, but ISTM that they
have similar needs to other things anyway. Having this flexibility is
even more important with global indexes, but the concepts themselves
are similar. We want options and maximum flexibility, everywhere.

> but if you are going to rescan the heap
> again next time before doing any index vacuuming then why we want to
> store them anyway.

It all depends, of course. The decision needs to be made using a cost
model. I suspect it will be necessary to try it out, and see.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2022-02-25 17:28:36 Re: Readd use of TAP subtests
Previous Message Andres Freund 2022-02-25 17:01:27 Re: BufferAlloc: don't take two simultaneous locks