Quick Links

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

From:	John Naylor <johncnaylorls(at)gmail(dot)com>
To:	Melanie Plageman <melanieplageman(at)gmail(dot)com>
Cc:	Peter Geoghegan <pg(at)bowt(dot)ie>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, Noah Misch <noah(at)leadboat(dot)com>
Subject:	Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
Date:	2025-06-23 05:44:05
Message-ID:	CANWCAZaRzYu4uNa-Xo8mWe1xp9zC9zjiY_UVCqMJMfTPmLQDOQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Fri, Jun 20, 2025 at 9:45 PM Melanie Plageman
<melanieplageman(at)gmail(dot)com> wrote:

> So, I think I figured out why I was seeing the test hang waiting for pg_stat_progress_vacuum to report an index count.
>
> Using auto_explain, I determined that the cursor was using an index-only scan with lower row counts. That meant it pinned an index leaf page instead of a heap page and the first round of index vacuuming couldn't complete because btree index vacuuming requires we acquire a cleanup lock on every leaf page.
>
> I solved this by disabling all index scans in the cursor's session.

Interesting find! I wondered how it would look if the cursor referred
to a extra non-indexed column, but the above seems fine and the whole
thing probably easier to reason about with only a single column.

> I attached the updated patch which passes for me on 32 and 64-bit builds. We've managed to reduce the row count so low (1000-2000 rows) that I'm not sure it matters if we have a 64-bit and 32-bit case. However, since we have the large block comment about the required number of rows, I figured we might as well have the two different nrows.

It seems backwards to have the comment influence the code -- the
comment should document decisions around the code. 2000 is already
100x smaller than pg16, so it's not really buying us much to be
platform-aware. The first two sentences in the comment seem fine.
After that, I'd just say:

# We choose the number of rows to make sure we exceed
maintenance_work_mem on all platforms we support, but we also want it
to be small so that the test runtime is short.

Last I checked, our regression tests fail on block sizes other than
8kB. If we ever need to cater to that, this comment covers that
eventuality as well.

> I'll have to do some more research on 14-16 to see if this could be a problem there.
>
> I also disabled prefetching, concurrent IO, and read combining for vacuum -- it didn't cause a problem in my local tests, but I could see it interfering with the test and potentially causing flakes/failures on some machines/configurations. That means I'll have to do a slightly different patch for 17 than 18 (17 doesn't have io_combine_limit).
>
> Finally, I disabled parallelism as a future-proofing for having heap vacuum parallelism -- wouldn't want a mysterious failure in this test in the future.

+ (PARALLEL 0 is a future-proofing measure in case we adopt
+ # parallel heap vacuuming)

Maybe it's possible to phrase this so it's true regardless of whether
we adopt that or not?
"PARALLEL 0 shouldn't be necessary, but guards against the possibility
of parallel heap vacuuming"

--
John Naylor
Amazon Web Services

In response to

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin at 2025-06-20 14:45:29 from Melanie Plageman

Responses

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin at 2025-06-23 22:30:24 from Melanie Plageman

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	jian he	2025-06-23 05:45:27	Re: New function normal_rand_array function to contrib/tablefunc.
Previous Message	Amit Kapila	2025-06-23 03:59:27	Re: Memory allocation error DDL invalidation (seen with 15.13 & 16.9)