Re: do only critical work during single-user vacuum?

From: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: Justin Pryzby <pryzby(at)telsasoft(dot)com>, "Bossart, Nathan" <bossartn(at)amazon(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: do only critical work during single-user vacuum?
Date: 2022-02-03 15:49:00
Message-ID: CAFBsxsHGRqsLFZsJ7TLsNwsyZJFDV2L-XWw9W9A0E7hX6BaLHw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Feb 3, 2022 at 3:14 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:

> + The only other option that may be combined with
> <literal>VERBOSE</literal>, although in single-user mode no client
> messages are
> + output.
>
> Given VERBOSE with EMERGENCY can work only in multi-user mode, why
> only VERBOSE can be specified with EMERGENCY? I think the same is true
> for other options like PARALLEL; PARALLEL can work only in multi-user
> mode.

You are right; it makes sense to allow options that would be turned
off automatically in single-user mode. Even if we don't expect it to
be used in normal mode, the restrictions should make sense. Also,
maybe documenting the allowed combinations is a distraction in the
main entry and should be put in the notes at the bottom.

> + It performs a database-wide vacuum on tables, toast tables, and
> materialized views whose
> + xid age or mxid age is older than 1 billion.
>
> Do we need to allow the user to specify the threshold or need a higher
> value (at least larger than 1.6 billion, default value of
> vacuum_failsafe_age)? I imagined a case where there are a few very-old
> tables (say 2 billion old) and many tables that are older than 1
> billion. In this case, VACUUM (EMERGENCY) would take a long time to
> complete.

I still don't think fine-tuning is helpful here. Shutdown vacuum
should be just as trivial to run as it is now, but with better
behavior. I believe a user knowledgeable enough to come up with the
best number is unlikely to get in this situation in the first place.
I'm also not sure a production support engineer would (or should)
immediately figure out a better number than a good default. That said,
the 1 billion figure was a suggestion from Peter G. upthread, and a
higher number could be argued.

> But to minimize the downtime, we might want to run VACUUM
> (EMERGENCY) on only the very-old tables, start the cluster in
> multi-user mode, and run vacuum on multiple tables in parallel.

That's exactly the idea. Also, back in normal mode, we can start
streaming WAL again. However, we don't want to go back online so close
to the limit that we risk shutdown again. People have a reasonable
expectation that if you fix an emergency, it's now fixed and the
application can go back online. Falling down repeatedly, or worrying
if it's possible, is very bad.

> + if (params->options & VACOPT_EMERGENCY)
> + {
> + /*
> + * Only consider relations able to hold unfrozen XIDs (anything else
> + * should have InvalidTransactionId in relfrozenxid anyway).
> + */
> + if (classForm->relkind != RELKIND_RELATION &&
> + classForm->relkind != RELKIND_MATVIEW &&
> + classForm->relkind != RELKIND_TOASTVALUE)
> + {
> + Assert(!TransactionIdIsValid(classForm->relfrozenxid));
> + Assert(!MultiXactIdIsValid(classForm->relminmxid));
> + continue;
> + }
> +
> + table_xid_age = DirectFunctionCall1(xid_age,
> classForm->relfrozenxid);
> + table_mxid_age = DirectFunctionCall1(mxid_age,
> classForm->relminmxid);
> +
>
> I think that instead of calling xid_age and mxid_age for each
> relation, we can compute the thresholds for xid and mxid once, and
> then compare them to relation's relfrozenxid and relminmxid.

That sounds like a good idea if it's simple to implement, so I will
try it. If it adds complexity, I don't think it's worth it. Scanning a
few thousand rows in pg_class along with the function calls is tiny
compared to the actual vacuum work.

--
John Naylor
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2022-02-03 16:04:33 Re: libpq async duplicate error results
Previous Message Ajin Cherian 2022-02-03 15:26:28 Re: row filtering for logical replication