Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
To: John Naylor <johncnaylorls(at)gmail(dot)com>
Cc: Peter Geoghegan <pg(at)bowt(dot)ie>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, Noah Misch <noah(at)leadboat(dot)com>
Subject: Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
Date: 2025-06-20 14:45:29
Message-ID: CAAKRu_Z-qjOgUjJqG_ScQFF_E9aEm1+uWPUJKXPwJ0QC27pKOQ@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jun 18, 2025 at 12:33 AM John Naylor <johncnaylorls(at)gmail(dot)com>
wrote:

> Here's what numbers I'm looking at as well:
>
> *** normal
> meson options: -Dcassert=true -Ddebug=true -Dc_args='-Og'
> -Dc_args='-fno-omit-frame-pointer'
>
> $ meson test -q --print-errorlogs --suite setup && meson test -q
> --print-errorlogs recovery/048_vacuum_horizon_floor
>
> $ grep finished
>
> build-debug/testrun/recovery/048_vacuum_horizon_floor/log/048_vacuum_horizon_floor_primary.log
>
> 2025-06-18 10:08:19.072 +07 [5730] 048_vacuum_horizon_floor.pl INFO:
> finished vacuuming "test_db.public.vac_horizon_floor_table": index
> scans: 12
>

Thanks for providing this! I actually did see the same number of index
scans as you with 9000 rows.

So, I think I figured out why I was seeing the test hang waiting for
pg_stat_progress_vacuum to report an index count.

Using auto_explain, I determined that the cursor was using an index-only
scan with lower row counts. That meant it pinned an index leaf page instead
of a heap page and the first round of index vacuuming couldn't complete
because btree index vacuuming requires we acquire a cleanup lock on every
leaf page.

I solved this by disabling all index scans in the cursor's session.

I attached the updated patch which passes for me on 32 and 64-bit builds.
We've managed to reduce the row count so low (1000-2000 rows) that I'm not
sure it matters if we have a 64-bit and 32-bit case. However, since we have
the large block comment about the required number of rows, I figured we
might as well have the two different nrows.

I'll have to do some more research on 14-16 to see if this could be a
problem there.

I also disabled prefetching, concurrent IO, and read combining for vacuum
-- it didn't cause a problem in my local tests, but I could see it
interfering with the test and potentially causing flakes/failures on some
machines/configurations. That means I'll have to do a slightly different
patch for 17 than 18 (17 doesn't have io_combine_limit).

Finally, I disabled parallelism as a future-proofing for having heap vacuum
parallelism -- wouldn't want a mysterious failure in this test in the
future.

- Melanie

Attachment Content-Type Size
v2-0001-Test-that-vacuum-removes-tuples-older-than-Oldest.patch text/x-patch 13.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Junwang Zhao 2025-06-20 14:45:44 Re: Fixes inconsistent behavior in vacuum when it processes multiple relations
Previous Message David G. Johnston 2025-06-20 14:41:42 Re: Huge commitfest app update upcoming: Tags, Draft CF, Help page, and automated commitfest creat/open/close