Re: do only critical work during single-user vacuum?

From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: do only critical work during single-user vacuum?
Date: 2022-02-04 01:35:39
Message-ID: 20220204013539.qdegpqzvayq3d4y2@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2022-02-03 17:02:15 -0500, Robert Haas wrote:
> On Thu, Feb 3, 2022 at 4:50 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > I wonder if we shouldn't add some exceptions to the xid allocation
> > prevention. It makes sense that we don't allow random DML. But it's e.g. often
> > more realistic to drop / truncate a few tables with unimportant content,
> > rather than spend the time vacuuming those. We could e.g. allow xid
> > consumption within VACUUM, TRUNCATE, DROP TABLE / INDEX when run at the top
> > level for longer than we allow it for anything else.
>
> True, although we currently don't start refusing XID allocation
> altogether until only 1 million remain, IIRC. And that's cutting it
> really close if we need to start consuming 1 XID per table we need to
> drop. We might need to push out some of the thresholds a bit.

Yea, I'd have no problem leaving the "hard" limit somewhere closer to 1
million (although 100k should be just as well), but introduce a softer "only
vacuum/drop/truncate" limit a good bit before that.

> For the most part, I think that there's no reason why autovacuum
> shouldn't be able to recover from this situation automatically, as
> long as old replication slots and prepared transactions are cleaned up
> and any old transactions are killed off.

To address the "as long as" part: I think that describing better what is
holding back the horizon would be a significant usability improvement.

Imagine that instead of the generic hints in these messages:
ereport(ERROR,
(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
errmsg("database is not accepting commands to avoid wraparound data loss in database \"%s\"",
oldest_datname),
errhint("Stop the postmaster and vacuum that database in single-user mode.\n"
"You might also need to commit or roll back old prepared transactions, or drop stale replication slots.")));
and
ereport(WARNING,
(errmsg("oldest xmin is far in the past"),
errhint("Close open transactions soon to avoid wraparound problems.\n"
"You might also need to commit or roll back old prepared transactions, or drop stale replication slots.")));

we'd actually tell the user a bit more what about what is causing the
problem.

We can compute the:
1) oldest slot by xmin, with name
2) oldest walsender by xmin, with pid
3) oldest prepared transaction id by xid / xmin, with name
4) oldest in-progress transaction id by xid / xmin, with name
5) oldest database datfrozenxid, with database name

If 1-4) are close to 5), there's no point in trying to vacuum aggressively, it
won't help. So we instead can say that the xmin horizon (with a better name)
is held back by the oldest of these, with enough identifying information for
the user to actually know where to look.

In contrast, if 5) is older than 1-4), then we can tell the user which
database is the problem, as we do right now, but we can stop mentioning the
"You might also need to commit ..." bit.

Also, adding an SRF providing the above in a useful format would be great for
monitoring and for "remote debugging" of problems.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2022-02-04 01:42:32 Re: support for CREATE MODULE
Previous Message Alexander Korotkov 2022-02-04 00:52:56 Re: [PATCH] reduce page overlap of GiST indexes built using sorted method