Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
To: Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, Kirill Reshke <reshkekirill(at)gmail(dot)com>, Xuneng Zhou <xunengzhou(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Subject: Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)
Date: 2026-03-06 23:33:10
Message-ID: CAAKRu_a+hO4PCptyaPR7AMZd7FjcHfOFKKJT8ouU3KedMud0tQ@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thanks for the review! Attached is v36. I've pushed some of the early
patches in the set and this is what is left. I've also done some of
the performance evaluation and microbenchmarking of the "worst case"
scenario promised in my earlier reply to Andres [1].

I used the following to test the worst-case performance of my patch:

pgbench -n -r -t 9 -f - <<'SQL'
checkpoint;
DROP TABLE IF EXISTS foo;
CREATE TABLE foo(a int, cnt int, data text) WITH (autovacuum_enabled =
false, fillfactor = 10);
ALTER TABLE foo ALTER COLUMN data SET STORAGE PLAIN;
CREATE INDEX ON foo(a);
INSERT INTO foo
SELECT i, 1, repeat(' ', 8192/10)
FROM generate_series(1,100000) i;
vacuum (freeze) foo;
update foo set cnt = cnt + 1;
select * from foo offset 100000000;
update foo set cnt = cnt + 1;
SQL

What I see is an expected slowdown for the SELECT * FROM foo OFFSET --
because it emits slightly more WAL and pins and dirties a few more
buffers. And a slight slowdown for the UPDATE following the SELECT
because it then must clear those VM bits. (This is no different than
if you had run a vacuum before doing the update).

These slowdowns are expected since this microbenchmark is designed to
be a worst case. Every buffer has a single tuple and the SELECT needs
to access no tuples because of the OFFSET. This minimizes all other
overheads to magnify the overhead of setting and clearing the VM.

I also tested if unconditionally pinning the VM even when we don't set
it had any impact on performance of on-access pruning for logged
tables. I used the setup above but patched the code to not set the VM
on-access. I found that there is no negative performance impact to the
SELECT * OFFSET. If foo is an unlogged table I do see a very slight
overhead for the SELECT * OFFSET.

And in all cases, with the patch, the vacuum above is faster because
of using the combined WAL record.

I believe I've addressed all of your review feedback. Below are
combined inline remarks to all three of your emails:

On Wed, Mar 4, 2026 at 4:00 AM Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com> wrote:
>
> * Caller must have pin on the buffer, and must *not* have a lock on it.
> */
> void
> -heap_page_prune_opt(Relation relation, Buffer buffer)
> +heap_page_prune_opt(Relation relation, Buffer buffer, Buffer *vmbuffer)
> ```
>
> I don’t see why vmbuffer has to be of pointer type. Buffer type is underlying int, I checked the last commit, vmbuffer only passes in data into the function without passing out anything.

We want to save the vmbuffer in the scan descriptor so we can use it
across calls to heap_page_prune_opt(). Therefore we have to pass it by
reference. We pin the VM in heap_page_prune_opt() and if we don't save
a reference to it, we'll have to pin it again on the next call (see
visibilitymap_pin() code).

> As we add the new parameter vmbuffer, though it’s not used in this commit, I think it’d be better to update the header commit to explain what this parameter will do.

Thanks, I've updated the header comment.

> + * heap_fix_vm_corruption() makes changes to the VM and, potentially, the heap
> + * page, but it does not need to be done in a critical section because
> + * clearing the VM is not WAL-logged.
> + */
> +static void
> +heap_fix_vm_corruption(PruneState *prstate, OffsetNumber offnum)
>
> Nit: why the last paragraph of the header comments uses the function name instead of “this function”? Looks like a copy-pasto.

Fixed.

> 6 - 0006
> ```
> + if (prstate->lpdead_items > 0)
> + {
> + ereport(WARNING,
> + (errcode(ERRCODE_DATA_CORRUPTED),
> + errmsg("LP_DEAD item found on page marked as all-visible"),
> + errdetail("relation \"%s\", page %u, tuple %u",
> + RelationGetRelationName(prstate->relation),
> + prstate->block, offnum)));
> + }
> + else
> + {
> + ereport(WARNING,
> + (errcode(ERRCODE_DATA_CORRUPTED),
> + errmsg("tuple not visible to all found on page marked as all-visible"),
> + errdetail("relation \"%s\", page %u, tuple %u",
> + RelationGetRelationName(prstate->relation),
> + prstate->block, offnum)));
> + }
> ```
>
> I recently just learned that a detail message should use complete sentences, and end each with a period, and capitalize the first word of sentences. See https://www.postgresql.org/docs/current/error-style-guide.html.

Ah thanks for noticing. I've gone ahead and changed them to errcontext
instead of errdetail. I think the messages are more compliant now.

> + else if (prstate->vmbits & VISIBILITYMAP_VALID_BITS)
> + {
> + /*
> + * As of PostgreSQL 9.2, the visibility map bit should never be set if
> + * the page-level bit is clear. However, it's possible that the bit
> + * got cleared after heap_vac_scan_next_block() was called, so we must
> + * recheck with buffer lock before concluding that the VM is corrupt.
> + */
> + ereport(WARNING,
> + (errcode(ERRCODE_DATA_CORRUPTED),
> + errmsg("page %u in \"%s\" is not marked all-visible but visibility map bit is set",
> + prstate->block,
> + RelationGetRelationName(prstate->relation))));
> + }
>
> The comment says “we must recheck with buffer lock before…”, but it only log a warning message. Is the comment stale?

We have the buffer lock here. The comment means that we need to check
now -- a time when we have the buffer lock because when we checked in
heap_vac_scan_next_block() we did not have the buffer lock. I've
updated the comment to try to make that more clear.

> +static void
> +heap_page_bypass_prune_freeze(PruneState *prstate, PruneFreezeResult *presult)
> +{
>
> * Given this function has done PageIsEmpty(page), that that is true, we don’t need to count live_tuples, right? That could be a tiny optimization.

Okay, I've tried this. I didn't want a lot of indentation, so I
reorganized the code. I'm not sure if it is more error-prone now,
though...

> * I see heap_page_bypass_prune_freeze() is only called in one place and immediately after prune_freeze_setup() and heap_fix_vm_corruption(), so prstate->vmbits must be 0, so do we need to do presult->vmbits = prstate->vmbits;?

Actually vmbits can't be zero, otherwise we won't reach the fast path
code. Or do you mean something else?

> * Do we need to set all_visible and all_frozen to presult?

I memset to 0 the other fields, so it isn't needed.

On Thu, Mar 5, 2026 at 3:53 AM Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com> wrote:
>
> 9 - 0009
> ···
> + * Currently, only VACUUM performs freezing, but other callers may in the
> + * future. Other callers must initialize prstate.all_frozen to false,
> ···
>
> Nit: prstate.all_frozen -> prstate.set_all_frozen
>
> I saw you have fixed this in 0010, but I think it’s better also fix it here.

Done.

> 10 - 0010
> ```
> + * Whether or not the page was newly set all-visible and all-frozen during
> + * phase I of vacuuming.
> */
> - uint8 vmbits;
> + BlockNumber new_all_visible_pages;
> + BlockNumber new_all_visible_frozen_pages;
> + BlockNumber new_all_frozen_pages;
> ```
>
> These 3 fields are actually counts rather than pointers to blocks, using type BlockNumber are quite confusing, though underlying BlockNumber is uint32. I think they can be just int type.

Covered in [2].

> + BlockNumber new_all_visible_pages;
> + BlockNumber new_all_visible_frozen_pages;
> + BlockNumber new_all_frozen_pages;
>
> I don’t see where these 3 fields are initialized. In lazy_scan_prune(), presult is defined as:
> PruneFreezeResult presult;
> So, those fields will hold random values.

Yes, thank you. I've fixed that.

> 13 - 0012 - bufmask.c
> ```
> + * we don't mark the page all-visible. See heap_xlog_prune_and_freeze()
> + * for more details.
> ```
>
> I don’t find a function named heap_xlog_prune_and_freeze().

Fixed in both places (-> heap_xlog_prune_freeze()).

On Thu, Mar 5, 2026 at 9:41 PM Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com> wrote:
>
> 15 - 0014 - execMain.c
> ```
> @@ -3027,6 +3035,7 @@ EvalPlanQualStart(EPQState *epqstate, Plan *planTree)
> rcestate->es_range_table_size = parentestate->es_range_table_size;
> rcestate->es_relations = parentestate->es_relations;
> rcestate->es_rowmarks = parentestate->es_rowmarks;
> + rcestate->es_modified_relids = parentestate->es_modified_relids;
> ```
>
> Here it just assigns the BMS pointer to rcestate->es_modified_relids. I am not sure if further bms_add_member() will still happen, if yes, it might be safer to do bms_copy(parentestate->es_modified_relids), because a further bms_add_member() may cause a new memory allocated and the old pointer stale.

Yes, it's at least a bit of future proofing. Done in v36.

> 16 - 0014 - execUtils.c
> for (rti = 1; rti <= estate->es_range_table_size; rti++)
>
> Nit: I have seen several recent commits that performed cleanups to switch to use for loop var like:
> for (Index rti = 1; rti <= estate->es_range_table_size; rti++)

Updated.

> 17 - 0015
>
> The commit message subject line says “Make begin_scan() functions take a flags argument”, where begin_scan() seems inaccurate, for example, table_index_fetch_begin() is not “begin scan”.
>
> Otherwise 0015 LGTM.

I've rewritten the commit message.

> 20 - 0018
> In heap_insert(), you do:
> + if (TransactionIdIsNormal(xid) && !(options & HEAP_INSERT_FROZEN))
> + PageSetPrunable(page, xid);
>
> But in heap_multi_insert(), you do:
> + if (!all_frozen_set && TransactionIdIsNormal(xid))
> + PageSetPrunable(page, xid);
>
> Is the option check " !(options & HEAP_INSERT_FROZEN))” also needed by heap_multi_insert?

heap_multi_insert() incorporates that into the variable
all_frozen_set, so it is not needed.

I've now also added setting prune hint for the new page on updates --
which I forgot before.

- Melanie

[1] https://www.postgresql.org/message-id/CAAKRu_a1V7TUUYM7qO2c5Z-JyTKOsrryQBrk7Eu69ESzhqgd9w%40mail.gmail.com
[2] https://www.postgresql.org/message-id/flat/CA%2BFpmFdrM%3DL5f%3De7%2BwqOkFkYK6r_S%3DTdKrHQ5qPbTNaoVG9PUA%40mail.gmail.com#e3420a65f192d59b72a33e56a8c2a50f

Attachment Content-Type Size
v36-0001-Use-the-newest-to-be-frozen-xid-as-the-conflict-.patch text/x-patch 6.5 KB
v36-0002-Save-vmbuffer-in-heap-specific-scan-descriptors-.patch text/x-patch 6.2 KB
v36-0003-Fix-visibility-map-corruption-in-more-cases.patch text/x-patch 18.6 KB
v36-0004-Add-pruning-fast-path-for-all-visible-and-all-fr.patch text/x-patch 4.5 KB
v36-0005-Use-GlobalVisState-in-vacuum-to-determine-page-l.patch text/x-patch 11.4 KB
v36-0006-Keep-newest-live-XID-up-to-date-even-if-page-not.patch text/x-patch 14.8 KB
v36-0007-Eliminate-XLOG_HEAP2_VISIBLE-from-vacuum-phase-I.patch text/x-patch 27.1 KB
v36-0008-Eliminate-XLOG_HEAP2_VISIBLE-from-empty-page-vac.patch text/x-patch 2.6 KB
v36-0009-Remove-XLOG_HEAP2_VISIBLE-entirely.patch text/x-patch 25.0 KB
v36-0010-Initialize-missing-fields-in-CreateExecutorState.patch text/x-patch 1.0 KB
v36-0011-Track-which-relations-are-modified-by-a-query.patch text/x-patch 5.5 KB
v36-0012-Thread-flags-through-begin-scan-APIs.patch text/x-patch 21.5 KB
v36-0013-Pass-down-information-on-table-modification-to-s.patch text/x-patch 8.0 KB
v36-0014-Allow-on-access-pruning-to-set-pages-all-visible.patch text/x-patch 9.9 KB
v36-0015-Avoid-BufferGetPage-calls-in-heap_update.patch text/x-patch 5.6 KB
v36-0016-Set-pd_prune_xid-on-insert.patch text/x-patch 10.4 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Jelte Fennema-Nio 2026-03-07 00:01:03 Re: Don't use the deprecated and insecure PQcancel in our frontend tools anymore
Previous Message KAZAR Ayoub 2026-03-06 23:31:19 Re: Speed up COPY FROM text/CSV parsing using SIMD