Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound

From: Aleksander Alekseev <aleksander(at)timescale(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Cc: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
Subject: Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
Date: 2023-04-03 12:33:28
Message-ID: CAJ7c6TPuMSyxy=BoipJaa4H4N2fp4up0ZEvq3J3eXXdC_ym3OA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi John,

Many thanks for all the great feedback!

> Okay, the changes look good. To go further, I think we need to combine into two patches, one with 0001-0003 and one with 0004:
>
> 1. Correct false statements about "shutdown" etc. This should contain changes that can safely be patched all the way to v11.
> 2. Change bad advice (single-user mode) into good advice. We can target head first, and then try to adopt as far back as we safely can (and test).

Done.

> > > In swapping this topic back in my head, I also saw [2] where Robert suggested
> > >
> > > "that old prepared transactions and stale replication
> > > slots should be emphasized more prominently. Maybe something like:
> > >
> > > HINT: Commit or roll back old prepared transactions, drop stale
> > > replication slots, or kill long-running sessions.
> > > Ensure that autovacuum is progressing, or run a manual database-wide VACUUM."
> >
> > It looks like the hint regarding replication slots was added at some
> > point. Currently we have:
> >
> > ```
> > errhint( [...]
> > "You might also need to commit or roll back old prepared
> > transactions, or drop stale replication slots.")));
> > ```
>
> Yes, the exact same text as it appeared in the [2] thread above, which prompted Robert's comment I quoted. I take the point to mean: All of these things need to be taken care of *first*, before vacuuming, so the hint should order things so that it is clear.
>
> > Please let me know if you think
> > we should also add a suggestion to kill long-running sessions, etc.
>
> +1 for also adding that.

OK, done. I included this change as a separate patch. It can be
squashed with another one if necessary.

While on it, I noticed that multixact.c still talks about a
"shutdown". I made corresponding changes in 0001.

> - errmsg("database is not accepting commands to avoid wraparound data loss in database \"%s\"",
> + errmsg("database is not accepting commands that generate new XIDs to avoid wraparound data loss in database \"%s\"",
>
> I'm not quite on board with the new message, but welcome additional opinions. For one, it's a bit longer and now ambiguous. I also bet that "generate XIDs" doesn't really communicate anything useful. The people who understand exactly what that means, and what the consequences are, are unlikely to let the system get near wraparound in the first place, and might even know enough to ignore the hint.
>
> I'm thinking we might need to convey something about "writes". While it's less technically correct, I imagine it's more useful. Remember, many users have it drilled in their heads that they need to drop immediately to single-user mode. I'd like to give some idea of what they can and cannot do.

This particular wording was chosen for consistency with multixact.c:

```
errmsg("database is not accepting commands that generate new
MultiXactIds to avoid wraparound data loss in database \"%s\"",
```

The idea of using "writes" is sort of OK, but note that the same
message will appear for a query like:

```
SELECT pg_current_xact_id();
```

... which doesn't do writes. The message will be misleading in this case.

On top of that, although a PostgreSQL user may not be aware of
MultiXactIds, arguably there are many users that are aware of XIDs.
Not to mention the fact that XIDs are well documented.

I didn't make this change in v4 since it seems to be controversial and
probably not the highest priority at the moment. I suggest we discuss
it separately.

> I propose for discussion that 0004 should show in the docs all the queries for finding prepared xacts, repl slots etc. If we ever show the info at runtime, we can dispense with the queries, but there seems to be no urgency for that...

Good idea.

> + Previously it was required to stop the postmaster and VACUUM the database
> + in a single-user mode. There is no need to use a single-user mode anymore.
>
> I think we need to go further and actively warn against it: It's slow, impossible to monitor, disables replication and disables safeguards against wraparound. (Other bad things too, but these are easily understandable for users)
>
> Maybe mention also that it's main use in wraparound situations is for a way to perform DROPs and TRUNCATEs if that would help speed up resolution.

Fixed.

--
Best regards,
Aleksander Alekseev

Attachment Content-Type Size
v4-0001-Correct-the-docs-and-messages-about-preventing-XI.patch application/octet-stream 8.3 KB
v4-0002-Don-t-recommend-running-VACUUM-in-a-single-user-m.patch application/octet-stream 4.0 KB
v4-0003-Modify-the-hints-about-preventing-XID-wraparound.patch application/octet-stream 10.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2023-04-03 12:40:41 Re: Sketch of a fix for that truncation data corruption issue
Previous Message Tomas Vondra 2023-04-03 12:25:59 Re: hio.c does visibilitymap_pin()/IO while holding buffer lock