Re: Expanding HOT updates for expression and partial indexes

From: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
To: Greg Burd <greg(at)burd(dot)me>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Expanding HOT updates for expression and partial indexes
Date: 2025-11-21 15:25:06
Message-ID: CAEze2WgMTBBeRf5y9JU9jtPK=oHRn7uW2JQWWp4OkFyJ58hkag@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, 19 Nov 2025 at 19:00, Greg Burd <greg(at)burd(dot)me> wrote:
>
> Attached are rebased (d5b4f3a6d4e) patches with the only changes
> happening in the last patch in the series.

Here's a high-level review of the patchset, extending on what I shared
offline. I haven't looked too closely at the code changes.

Re: Perf testing

Apart from the workload we've discussed offline, there's another
workload to consider: Right now, we only really consider HOT when we
know there's space on the page. This patch, however, will front-load a
lot more checks before we have access to the page, and that will
~always impact update performance.
I'm a bit worried that the cost of off-page updates (when the page of
the old tuple can't fit the new tuple) will be too significantly
increased, especially considering that we have a default fillfactor of
100 -- if a table's tuples only grow, it's quite likely the table
frequently can't apply HOT regardless of the updated columns. So, a
workload that's tuned to only update tuples in a way that excercises
'can we HOT or not' code for already-full pages would be appreciated.
A solution to the issue might be needed if we lose too much
performance on expensive checks; a solution like passing page space of
the old tuple to the checks, and short-circuiting the non-HOT path if
that page space is too small for the new tuple.

0001:

I'm not sure I understand why the code is near completely duplicated
here. Maybe you can rename the resulting heap_update to
heap_update_ext, and keep a heap_update around which wraps this
heap_update_ext, allowing old callers to keep their signature until
they need to use _ext's features? Then you can introduce the
duplication if and when needed in later patches -- though I don't
expect a lot of this duplication to be strictly necessary, though that
may need some new helper functions.

I also see that for every update we're now copying, passing 5
bitmapsets individually and then freeing those bitmaps just a moment
later. I'd like to avoid that overhead and duplication if possible.
Maybe we can store these in an 'update context' struct, passed by
reference down to table_tuple_update() from the calling code, and then
onward to heap_update? That might then also be a prime candidate to
contain the EState * of ExecCheckIndexedAttrsForChanges.

0002:

This patch seems to have some formatting updates to changes you made
in 0001, without actually changing the code (e.g. at heap_update's
definition). When updating code, please put it in the expected
formatting in the same patch.

---
In the patch subject:
> For instance,
> indexes with collation information allowing more HOT updates when the
> index is specified to be case insensitive.

It is incorrect to assume that indexed "btree-equal" datums allow HOT
updates. The user can not be returned an old datum in index-only
scans, even if it's sorted the same as the new datum -- after all, a
function on the datum may return different results even if the datums
are otherwise equal. Think: count_uppercase(string). See also below at
"HOT, datum compare, etc".

---
With the addition of rd_indexedattr we now have 6 bitmaps in
RelationData, which generally only get accessed through
RelationGetIndexAttrBitmap by an enum value. Maybe it's now time to
bite the bullet and change that to a more general approach with
`Bitmapset *rd_bitmaps[NUM_INDEX_ATTR_BITMAP]`? That way, the hot
path in RelationGetIndexAttrBitmap would not depend on the compiler to
determine that the fast path can do simple offset arithmatic to get
the requested bitmap.

0003:
This looks like it's a cleanup patch for 0002, and doesn't have much
standing on its own. Maybe the changes for ri_ChangedIndexedCols can
all be moved into 0003? I think that gives this patch more weight and
standing.

0004:
This does two things:
1. Add index expression evaluation to the toolset to determine which
indexes were unchanged, and
2. Allow index access methods to say "this value has not changed"
even if the datum itself may have changed.

Could that be split up into two different patches?

(aside: 2 makes a lot of sense in some cases, like trgm indexes
strings if no new trigrams are added/removed, so I really like the
idea behind this change)

HOT, datum compare, etc.:

Note that for index-only scans on an index to return correct results,
you _must_ update the index (and thus, do a non-HOT update) whenever a
value changes its binary datum, even if the value has the same btree
sort location as the old value. Even for non-IOS-supporting indexes,
the index may need more information than what's used in btree
comparisons when it has a btree opclass.
As SP/GIST have IOS support, they also need to compare the image of
the datum and not use ordinary equality as defined in nbtree's compare
function: the value must be exactly equal to what the table AM
would've provided.

Primary example: Btree compares the `numeric` datums of `1.0` and
`1.00` as equal, but for a user there is an observable difference; the
following SQL must return `1.0` in every valid plan:

BEGIN;
INSERT INTO mytab (mynumeric) VALUES ('1.00');
UPDATE mytab SET mynumeric = '1.0';
SELECT mynumeric FROM mytab;

So, IMO, the default datum compare in ExecCheckIndexedAttrsForChanges
and friends should just use datumIsEqual, and not this new
tts_attr_equal.
Indexes without IOS support might be able to opt into using a more lax
datum comparator, but 1.) it should never be the default, as it'd be a
loaded footgun for IndexAM implementers, and 2.) should not depend on
another AM's understanding of attributes, as that is a very leaky
abstraction.

Kind regards,

Matthias van de Meent
Databricks (https://www.databricks.com)

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2025-11-21 15:35:16 Re: more C99 cleanup
Previous Message Andrew Dunstan 2025-11-21 14:48:53 Re: Speed up COPY FROM text/CSV parsing using SIMD