Re: Overhauling "Routine Vacuuming" docs, particularly its handling of freezing

From: Ryan Booz <ryan(at)softwareandbooz(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: samay sharma <smilingsamay(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Overhauling "Routine Vacuuming" docs, particularly its handling of freezing
Date: 2023-05-12 17:36:27
Message-ID: CAMjhCZoPiOyGMmhGnxBHLgvqnJjJnEDY+-jJ3z2WNwUhfcGbjA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thanks for the continued work, Peter. I hate to be the guy that starts this way,
but this is my first ever response on pgsql-hackers. (insert awkward
smile face).
Hopefully I've followed etiquette well, but please forgive any
missteps, and I'm
happy for any help in making better contributions in the future.

On Thu, May 11, 2023 at 9:19 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
>
> On Thu, May 4, 2023 at 3:18 PM samay sharma <smilingsamay(at)gmail(dot)com> wrote:
> > What do you think about the term "Exhaustion"? Maybe something like "XID allocation exhaustion" or "Exhaustion of allocatable XIDs"?
>
> I use the term "transaction ID exhaustion" in the attached revision,
> v4. Overall, v4 builds on the work that went into v2 and v3, by
> continuing to polish the overhaul of everything related to freezing,
> relfrozenxid advancement, and anti-wraparound autovacuum.

Just to say on the outset, as has been said earlier in the tread by others,
that this is herculean work. Thank you for putting the effort you have thus far.
There's a lot of good from where I sit in the modification efforts.
It's a heavy,
dense topic, so there's probably never going to be a perfect way to
get it all in,
but some of the context early on, especially, is helpful for framing.

>
> It would be nice if it was possible to add an animation/diagram a
> little like this one: https://tuple-freezing-demo.angusd.com (this is
> how I tend to think about the "transaction ID space".)

Indeed. With volunteer docs, illustrations/diagrams are hard for sure. But,
this or something akin to the "clock" image I've seen elsewhere when
describing the transaction ID space would probably be helpful if it were ever
possible. In fact, there's just a lot about the MVCC stuff in general that
would benefit from diagrams. But alas, I guess that's why we have some
good go-to community talks/slide decks. :-)

> v4 also limits use of the term "wraparound" to places that directly
> discuss anti-wraparound autovacuums (plus one place in xact.sgml,
> where discussion of "true unsigned integer wraparound" and related
> implementation details has been moved). Otherwise we use the term
> "transaction ID exhaustion", which is pretty much the user-facing name
> for "xidStopLimit". I feel that this is a huge improvement, for the
> reason given to Greg earlier. I'm flexible on the details, but I feel
> strongly that we should minimize use of the term wraparound wherever
> it might have the connotation of "the past becoming the future". This
> is not a case of inventing a new terminology for its own sake. If
> anybody is skeptical I ask that they take a look at what I came up
> with before declaring it a bad idea. I have made that as easy as
> possible, by once again attaching a prebuilt routine-vacuuming.html.

Thanks again for doing this. Really helpful for doc newbies like me that
want to help but are still working through the process. Really helpful
and appreciated.

>
>
> Other changes in v4, compared to v3:
>
> * Improved discussion of the differences between non-aggressive and
> aggressive VACUUM.

This was helpful for me and not something I've previously put much thought
into. Helpful context that is missing from the current docs.

> * Explains "catch-up freezing" performed by aggressive VACUUMs directly.
>
> "Catch-up" freezing is the really important "consequence" -- something
> that emerges from how each type of VACUUM behaves over time. It is an
> indirect consequence of the behaviors. I would like to counter the
> perception that some users have about freezing only happening during
> aggressive VACUUMs (or anti-wraparound autovacuums). But more than
> that, talking about catch-up freezing seems essential because it is
> the single most important difference.
>

Similarly, this was helpful overall context of various things
happening with freezing.

> * Much improved handling of the discussion of anti-wraparound
> autovacuum, and how it relates to aggressive VACUUMs, following
> feedback from Samay.
>
> There is now only fairly minimal overlap in the discussion of
> aggressive VACUUM and anti-wraparound autovacuuming. We finish the
> discussion of aggressive VACUUM just after we start discussing
> anti-wraparound autovacuum. This transition works well, because it
> enforces the idea that anti-wraparound autovacuum isn't really special
> compared to any other aggressive autovacuum. This was something that
> Samay expressed particularly concern about: making anti-wraparound
> autovacuums sound less scary. Though it's also a concern I had from
> the outset, based on practical experience and interactions with people
> that have much less knowledge of Postgres than I do.

Agree. This flows fairly well and helps the user understand that each
"next step"
in the vacuum/freezing process has a distinct job based on previous work.

>
> * Anti-wraparound autovacuum is now mostly discussed as something that
> happens to static or mostly-static tables....
> ...This moves discussion of anti-wraparound av in the direction of:
> "Anti-wraparound autovacuum is a special type of autovacuum. Its
> purpose is to ensure that relfrozenxid advances when no earlier VACUUM
> could advance it in passing — often because no VACUUM has run against
> the table for an extended period."
>

Again, learned something new here, at least in how I think about it and talk
with others. In total, I do think these changes make wraparound/exhaustion
seem less "the sky is falling".

> * Added a couple of "Tips" about instrumentation that appears in the
> server log whenever autovacuum reports on a VACUUM operation.
>
> * Much improved "Truncating Transaction Status Information" subsection.
>
> My explanation of the ways in which autovacuum_freeze_max_age can
> affect the storage overhead of commit/abort status in pg_xact is much
> clearer than it was in v3 -- pg_xact truncation is now treated as
> something loosely related to the global config of anti-wraparound
> autovacuum, which makes most sense.
>
This one isn't totally sinking in with me yet. Need another read.

> It took a great deal of effort to find a structure that covered
> everything, and that highlighted all of the important relationships
> without going too far, while at the same time not being a huge mess.
> That's what I feel I've arrived at with v4.

In most respects I agree with the overall flow of changes w.r.t the current doc.
Focusing on all of this as something that should normally just be happening
as part of autovacuum is helpful. Working through it as an order of operations
(and I'm just assuming this is the general order) feels like it ties
things together
a lot more. I honestly come away from this document with more of a "I understand
the process" feel than I did previously.

For now, I'd add the following few comments on the intro section,
2.5.1 and 2.5.2. I
haven't gotten to the bottom sections yet for much feedback.

Intro Comments:
1) "The autovacuum daemon automatically schedules maintenance tasks based on
workload requirements." feels at tension with "Autovacuum scheduling
is controlled
via threshold settings."

Owing to the lingering belief that many users have whereby hosting providers
have magically enabled Postgres to do all of this for you, there is
still a need to
actively deal with these thresholds based on load. That is, as far as
I understand,
Postgres doesn't automatically adjust based on load. Someone/thing
still has to modify
the thresholds as load and data size changes.

If the "workload requirements" is pointing towards aggressive
freezing/wraparound
tasks that happen regardless of thresholds, then for me at least that
isn't clear
in that sentence and it feels like there's an implication that
Postgres/autovacuum
is going to magically adjust overall vacuum work based on database workload.

2) "The intended audience is database administrators that wish to
perform more advanced
autovacuum tuning, with any of the following goals in mind:"

I love calling out the audience in some way. That's really helpful, as are the
stated goals in the bullet list. However, as someone feeling pretty novice
after reading all of this, I can't honestly connect how the content on this page
helps me to more advanced tuning. I have a much better idea how freezing,
in particular, works (yay!), but I'm feeling a bit dense how almost anything
here helps me tune vacuum, at least as it relates to the bullets.

I'm sure you have a connection in mind for each, and certainly understanding the
inner workings of what's happening under the covers is tremendously beneficial,
but when I search for "response" or "performance" in this document, it refers
back to another page (not included in this patch) that talks about the
thresholds.

It might be as simple as adding something to the end of each bullet to draw
that relationship, but as is, it's hard for me to do it mentally (although I can
conjecture a few things on my own)

That said, I definitely appreciate the callout that tuning is an
iterative process
and the minor switch from "creates a substantial amount of I/O
traffic" to "may create...".

** Section 2.5.1 - Recovering Disk Space **

3). "The space dead tuples occupy must eventually be reclaimed for reuse
by new rows, to avoid unbounded growth of disk space requirements. Reclaiming
space from dead rows is VACUUM's main responsibility."

It feels like one connection you could make to the bullet list above
is in this area
and not mentioned. By freeing up space and reducing the number of pages that
need to be read for satisfying a query, vacuum and recovering disk space
(theoretically) improves query performance. Not 100% how to add it in context
of these first two paragraphs.

4) Caution: "It may be a good idea to add monitoring to alert you about this."
I hate to be pedantic about it, but I think we should spell out
"this". Do we have
a pointer in documentation to what kinds of things to monitor for? Am monitoring
long-running transactions or some metric that shows me that VACUUM is being
"held back"? I know what you mean, but it's not clear to me how to do the right
thing in my environment here.

5) The plethora of tips/notes/warnings.
As you and others have mentioned, as presented these really have no context
for me. Individually they are good/helpful information, but it's
really hard to make
a connection to what I should "do" about it.

It seems to me that this would be a good place to put a subsection which is
something like, "A note about reclaiming disk space" or something. In my
experience, most people hear about and end up using VACUUM FULL because
things got out of control and they want to get into a better spot (I have been
in that boat). I think with a small section that says, in essence,
"hey, now that
you understand why/how vacuum reclaims disk resources normally, if you're
in a position where things aren't in a good state, this is what you need to know
if you want to reclaim space from a really inefficient table"

For me, at least, I think it would be easier to read/grok what you're sharing in
these callouts.

6) One last missing piece that very well might be in another page not referenced
(I obviously need to get the PG16 docs pulled and built locally so
that I can have
better overall reference. My apologies).

In my experience, one of the biggest issues with the thresholds and recovering
space is the idea of tuning individual tables, not just the entire
database. 5/10/20%
might be fine for most tables, but it's usually the really active ones
that need the
tuning, specifically lowering the thresholds. That doesn't come across to me in
this section at all. Again, maybe I've missed something on another page and
it's all good, but it felt worth calling out.

Plus, it may provide an opportunity to bring in the threshold formulas again if
they aren't referenced elsewhere (although they probably are).

Hope that makes sense.

** Section 2.5.2: Freezing to manage... **
As stated above, the effort here overall is great IMO. I like the flow
and reduction
in alarmist tone for things like wraparound, etc. I understand more
about freezing,
aggressive and otherwise, than I did before.

7) That said, totally speaking as a non-contributor, this section is
obviously very long
for good reason. But, by the time I've gotten down to 25.2.2.3, my
brain is a bit
bewildered on where we've gotten to. That's more a comment on my capability
to process it all, but I wonder if a slightly more explicit intro
could help set the
stage at least.

"One side-effect of vacuum and transaction ID management at the row level is
that PostgreSQL would normally need to inspect each row for every query to
ensure it is visible to each requesting transaction. In order to
reduce the need to
read and inspect excessive amounts of data at query time or when normal vacuum
maintenance kicks in, VACUUM has a second job called freezing, which
accomplishes three goals: (attempting to tie in the three sections)
* speeding up queries and vacuum operations by...
* advancing the transaction ID space on generally static tables...
* ensure there are always free transaction IDs available for normal
operation...
"

Maybe totally worthless and too much, but something like that might set a reader
up for just a bit more context. Then you could take most of what comes before
"2.5.2.2.1 Aggressive Vacuum" as a subsection (would require a renumber below)
with something like "2.5.2.2.1 Normal Freezing Activity"

8) Note "In PostgreSQL versions before 16..."
Showing my naivety, somehow this isn't connecting with me totally. If
it's important
to call out, then maybe we need a connecting sentence. Based on the content
above, I think you're pointing to "It's also why VACUUM will freeze all eligible
tuples from a heap page once the decision to freeze at least one tuple
is taken:"
If that's it, it's just not clear to me what's totally changed. Sorry,
more learning. :-)

---
Hope something in there is helpful.

Ryan Booz

>
> --
> Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ryan Booz 2023-05-12 17:37:34 Re: Overhauling "Routine Vacuuming" docs, particularly its handling of freezing
Previous Message Tom Lane 2023-05-12 17:28:04 Re: psql tests hangs