Re: Overhauling "Routine Vacuuming" docs, particularly its handling of freezing

From: samay sharma <smilingsamay(at)gmail(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Overhauling "Routine Vacuuming" docs, particularly its handling of freezing
Date: 2023-05-04 21:57:32
Message-ID: CAJxrbyxgWq3KZ3ExLbQ_S-H8zT8md5kU5=RD2bqV4LCo8chQ9A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Wed, May 3, 2023 at 2:59 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:

> Hi Samay,
>
> On Tue, May 2, 2023 at 11:40 PM samay sharma <smilingsamay(at)gmail(dot)com>
> wrote:
> > Thanks for taking the time to do this. It is indeed difficult work.
>
> Thanks for the review! I think that this is something that would
> definitely benefit from a perspective such as yours.
>

Glad to hear that my feedback was helpful.

>
> > There are things I like about the changes you've proposed and some where
> I feel that the previous section was easier to understand.
>
> That makes sense, and I think that I agree with every point you've
> raised, bar none. I'm pleased to see that you basically agree with the
> high level direction.
>
> I would estimate that the version you looked at (v2) is perhaps 35%
> complete. So some of the individual problems you noticed were a direct
> consequence of the work just not being anywhere near complete. I'll
> try to do a better job of tracking the relative maturity of each
> commit/patch in each commit message, going forward.
>
> Anything that falls under "25.2.1. Recovering Disk Space" is
> particularly undeveloped in v2. The way that I broke that up into a
> bunch of WARNINGs/NOTEs/TIPs was just a short term way of breaking it
> up into pieces, so that the structure was very approximately what I
> wanted. I actually think that the stuff about CLUSTER and VACUUM FULL
> belongs in a completely different chapter. Since it is not "Routine
> Vacuuming" at all.
>
> >> 2. Renamed "Preventing Transaction ID Wraparound Failures" to
> >> "Freezing to manage the transaction ID space". Now we talk about
> >> wraparound as a subtopic of freezing, not vice-versa. (This is a
> >> complete rewrite, as described by later items in this list).
> >
> > +1 on this too. Freezing is a normal part of vacuuming and while the
> aggressive vacuums are different, I think just talking about the worst case
> scenario while referring to it is alarmist.
>
> Strangely enough, Postgres 16 is the first version that instruments
> freezing in its autovacuum log reports. I suspect that some long term
> users will find it quite surprising to see how much (or how little)
> freezing takes place in non-aggressive VACUUMs.
>
> The introduction of page-level freezing will make it easier and more
> natural to tune settings like vacuum_freeze_min_age, with the aim of
> smoothing out the burden of freezing over time (particularly by making
> non-aggressive VACUUMs freeze more). Page-level freezing removes any
> question of not freezing every tuple on a page (barring cases where
> "removable cutoff" is noticeably held back by an old MVCC snapshot).
> This makes it more natural to think of freezing as a process that
> makes it okay to store data in individual physical heap pages, long
> term.
>
> > 1) While I agree that bundling VACUUM and VACUUM FULL is not the right
> way, moving all VACUUM FULL references into tips and warnings also seems
> excessive. I think it's probably best to just have a single paragraph which
> talks about VACUUM FULL as I do think it should be mentioned in the
> reclaiming disk space section.
>
> As I mentioned briefly already, my intention is to move it to another
> chapter entirely. I was thinking of "Chapter 29. Monitoring Disk
> Usage". The "Routine Vacuuming" docs would then link to this sect1 --
> something along the lines of "non-routine commands to reclaim a lot of
> disk space in the event of extreme bloat".
>
> > 2) I felt that the new section, "Freezing to manage the transaction ID
> space" could be made simpler to understand. As an example, I understood
> what the parameters (autovacuum_freeze_max_age, vacuum_freeze_table_age) do
> and how they interact better in the previous version of the docs.
>
> Agreed. I'm going to split it up some more. I think that the current
> "25.2.2.1. VACUUM's Aggressive Strategy" should be split in two, so we
> go from talking about aggressive VACUUMs to Antiwraparound
> autovacuums. Finding the least confusing way of explaining it has been
> a focus of mine in the last few days.
>

To be honest, this was not super simple to understand even in the previous
version. However, as our goal is to simplify this and make it easier to
understand, I'll hold this patch-set to a higher standard :).

I wish there was a simple representation (maybe even a table or something)
which would explain the differences between a VACUUM which is not
aggressive, a VACUUM which ends up being aggressive due to
vacuum_freeze_table_age and an antiwraparound autovacuum.

>
> > 4) I think we should explicitly call out that seeing an anti-wraparound
> VACUUM or "VACUUM table (to prevent wraparound)" is normal and that it's
> just a VACUUM triggered due to the table having unfrozen rows with an XID
> older than autovacuum_freeze_max_age. I've seen many users panicking on
> seeing this and feeling that they are close to a wraparound.
>
> That has also been my exact experience. Users are terrified, usually
> for no good reason at all. I'll make sure that this comes across in
> the next revision of the patch series.
>

Thinking about it a bit more, I wonder if there's value in changing the
"(to prevent wraparound)" to something else. It's understandable why people
who just see that in pg_stat_activity and don't read docs might assume they
are close to a wraparound.

Regards,
Samay

>
> > Also, we should be more clear about how it's different from VACUUMs
> triggered due to the scale factors (cancellation behavior, being triggered
> when autovacuum is disabled etc.).
>
> Right. Though I think that the biggest point of confusion for users is
> how *few* differences there really are between antiwraparound
> autovacuum, and any other kind of autovacuum that happens to use
> VACUUM's aggressive strategy. There is really only one important
> difference: the autocancellation behavior. This is an autovacuum
> behavior, not a VACUUM behavior -- so the "VACUUM side" doesn't know
> anything about that at all.

> > 5) Can we use a better name for the XidStopLimit mode? It seems like a
> very implementation centric name. Maybe a better version of "Running out of
> the XID space" or something like that?
>
> Coming up with a new user-facing name for xidStopLimit is already on
> my TODO list (it's surprisingly hard). I have used that name so far
> because it unambiguously refers to the exact thing that I want to talk
> about when discussing the worst case. Other than that, it's a terrible
> name.
>
> > 6) In the XidStopLimit mode section, it would be good to explain briefly
> why you could get to this scenario. It's not something which should happen
> in a normal running system unless you have a long running transaction or
> inactive replication slots or a badly configured system or something of
> that sort.
>
> I agree that that's important. Note that there is already something
> about "removable cutoff" being held back at the start of the
> discussion of freezing -- that will prevent freezing in exactly the
> same way as it prevents cleanup of dead tuples.
>
> That will become a WARNING box in the next revision. There should also
> be a similar, analogous WARNING box (about "removable cutoff" being
> held back) much earlier on in the docs -- this should appear in
> "25.2.1. Recovering Disk Space". Obviously this structure suggests
> that there is an isomorphism between freezing and removing bloat. For
> example, if you cannot "freeze" an XID that appears in some tuple's
> xmax, then you also cannot remove that tuple because VACUUM only sees
> it as a recently dead tuple (if xmax is >= OldestXmin/removable
> cutoff, and from a deleter that already committed).
>
> I don't think that we need to spell the "isomorphism" point out to the
> reader directly, but having a subtle cue that that's how it works
> seems like a good idea.
>
> > If you got to this point, other than running VACUUM to get out of the
> situation, it's also important to figure out what got you there in the
> first place as many VACUUMs should have attempted to advance the
> relfrozenxid and failed.
>
> It's also true that problems that can lead to the system entering
> xidStopLimit mode aren't limited to cases where doing required
> freezing is fundamentally impossible due to something holding back
> "removable cutoff". It's also possible that VACUUM simply can't keep
> up (though the failsafe has helped with that problem a lot).
>
> I tend to agree that there needs to be more about this in the
> xidStopLimit subsection (discussion of freezing being held back by
> "removable cutoff" is insufficient), but FWIW that seems like it
> should probably be treated as out of scope for this patch. It is more
> the responsibility of the other patch [1] that aims to put the
> xidStopLimit documentation on a better footing (and remove that
> terrible HINT about single user mode).
>
> Of course, that other patch is closely related to this patch -- the
> precise boundaries are unclear at this point. In any case I think that
> this should happen, because I think that it's a good idea.
>
> > There are a few other small things I noticed along the way but my goal
> was to look at the overall structure.
>
> Thanks again! This is very helpful.
>
> [1]
> https://www.postgresql.org/message-id/flat/CAJ7c6TM2D277U2wH8X78kg8pH3tdUqebV3_JCJqAkYQFHCFzeg%40mail.gmail.com
> --
> Peter Geoghegan
>

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2023-05-04 22:11:49 Re: A bug with ExecCheckPermissions
Previous Message Jonathan S. Katz 2023-05-04 19:41:57 Re: pg_stat_io not tracking smgrwriteback() is confusing