| From: | Greg Burd <greg(at)burd(dot)me> |
|---|---|
| To: | Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com> |
| Cc: | pgsql-hackers(at)postgresql(dot)org <pgsql-hackers(at)postgresql(dot)org> |
| Subject: | Re: Expanding HOT updates for expression and partial indexes |
| Date: | 2025-11-22 21:30:39 |
| Message-ID: | AECB4D3E-5D35-48A9-8633-CAF4837C2056@greg.burd.me |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Nov 21 2025, at 10:25 am, Matthias van de Meent
<boekewurm+postgres(at)gmail(dot)com> wrote:
> On Wed, 19 Nov 2025 at 19:00, Greg Burd <greg(at)burd(dot)me> wrote:
>>
>> Attached are rebased (d5b4f3a6d4e) patches with the only changes
>> happening in the last patch in the series.
>
> Here's a high-level review of the patchset, extending on what I shared
> offline. I haven't looked too closely at the code changes.
Matthias, thanks for spending a bit of time writing up your thoughts and
for chatting a bit with me before doing so. I really appreciate your
point of view.
> Re: Perf testing
>
> Apart from the workload we've discussed offline, there's another
> workload to consider: Right now, we only really consider HOT when we
> know there's space on the page.
Yes, and no. In master every UPDATE triggers a call into
HeapDetermineColumnsInfo() in heap_update(). That function's job is to
examine all the indexed attributes checking each attribute for changes.
The set of indexed attributes is generated by combining a set of
Bitmapsets fetched (copied) from the relcache:
hot_attrs = RelationGetIndexAttrBitmap(relation, INDEX_ATTR_BITMAP_HOT_BLOCKING);
sum_attrs = RelationGetIndexAttrBitmap(relation, INDEX_ATTR_BITMAP_SUMMARIZED);
key_attrs = RelationGetIndexAttrBitmap(relation, INDEX_ATTR_BITMAP_KEY);
id_attrs = RelationGetIndexAttrBitmap(relation, INDEX_ATTR_BITMAP_IDENTITY_KEY);
interesting_attrs = NULL;
interesting_attrs = bms_add_members(interesting_attrs, hot_attrs);
interesting_attrs = bms_add_members(interesting_attrs, sum_attrs);
interesting_attrs = bms_add_members(interesting_attrs, key_attrs);
interesting_attrs = bms_add_members(interesting_attrs, id_attrs);
These "interesting attributes" are each checked comparing the newly
formed updated HeapTuple passed in from the executor against a HeapTuple
read from the page. This happens after taking the buffer lock on that
page. The comparison is done with heap_attr_compare() which calls
datumIsEqual() which boils down to a memcmp(). The function returns the
"modified_attrs".
This is the primary "work" that happens before considering HOT that I've
moved outside the buffer lock and into the executor. The rest of the
HOT decision is 1) will it fit (possibly after being TOASTed), and b) do
the HOT blocking attributes overlap with the modified_attrs? If they
do, unless they are all summarizing you can't go HOT.
Net new work with my approach is only related to the more complicated
checks that have to be done when there are expressions, partial indexes,
or an index implements the new index AM API for comparing datums. In
those cases there is a bit more work, especially when it comes to
expression indexes. Evaluating partial index predicate twice rather
than once is a bit more overhead. Using type-specific comparison when
the index doesn't support index-only scans is a tad more. Overall, not
a whole lot of new work. Certainly a much more complex code path, and
maybe that'll show up in performance tests. I don't know yet.
> This patch, however, will front-load a
> lot more checks before we have access to the page, and that will
> ~always impact update performance.
Disagree, 90% of that work happens today on that same path after the
page lock. Even in the case that the HeapTuple can't fit into the page
that work will happen, there's not much net new "wasted" effort.
Concurrency under load may be better because buffer locks will be held
for less time.
> I'm a bit worried that the cost of off-page updates (when the page of
> the old tuple can't fit the new tuple) will be too significantly
> increased, especially considering that we have a default fillfactor of
> 100 -- if a table's tuples only grow, it's quite likely the table
> frequently can't apply HOT regardless of the updated columns. So, a
> workload that's tuned to only update tuples in a way that excercises
> 'can we HOT or not' code for already-full pages would be appreciated.
> A solution to the issue might be needed if we lose too much
> performance on expensive checks; a solution like passing page space of
> the old tuple to the checks, and short-circuiting the non-HOT path if
> that page space is too small for the new tuple.
Here's the problem, there's not much that can be known about what
ultimate size the HeapTuple will take or if the page can hold it until
after the lock.
Second, I feel that this signal would be specific to the heap, it's
particular MVCC implementation, and it's optimization (HOT). I really
wanted this solution to be non-heap-specific, but heap-enabling.
For me that meant that in the general case the executor should be
concerned with updating only the set of indexes that it must and no
more. So performing this work in the executor ahead of calling into the
table AM or index AM makes sense.
It is only due to heap's "special" model that we have other concerns
related to HOT. Sure, everyone/everything today uses heap so we should
pay attention to this, but I set out not to create yet another thing
that depends on heap's specific operational model. I think I did that.
> 0001:
>
> I'm not sure I understand why the code is near completely duplicated
> here. Maybe you can rename the resulting heap_update to
> heap_update_ext, and keep a heap_update around which wraps this
> heap_update_ext, allowing old callers to keep their signature until
> they need to use _ext's features? Then you can introduce the
> duplication if and when needed in later patches -- though I don't
> expect a lot of this duplication to be strictly necessary, though that
> may need some new helper functions.
This is just a split where I move the top portion of heap_update() into
the two paths that use it. Sure I could have pulled that into another
function, but in this series the next step is to obliterate one half and
(if I get my other patch in cf-6221) then I can completely remove
HeapDetermineColumnsInfo() and vastly simplify simple_heap_update().
> I also see that for every update we're now copying, passing 5
> bitmapsets individually and then freeing those bitmaps just a moment
> later. I'd like to avoid that overhead and duplication if possible.
We do this today, nothing new here. They are passed by reference, not
value into heap_update().
> Maybe we can store these in an 'update context' struct, passed by
> reference down to table_tuple_update() from the calling code, and then
> onward to heap_update? That might then also be a prime candidate to
> contain the EState * of ExecCheckIndexedAttrsForChanges.
I've explored using UpdateContext before, not a bad idea but again this
is just a setup commit. It could be cleaner on it's own, but it doesn't
really take a step backward on any dimension.
> 0002:
>
> This patch seems to have some formatting updates to changes you made
> in 0001, without actually changing the code (e.g. at heap_update's
> definition). When updating code, please put it in the expected
> formatting in the same patch.
I'll find/fix those after v23 attached, sorry for the noise.
> ---
> In the patch subject:
>> For instance,
>> indexes with collation information allowing more HOT updates when the
>> index is specified to be case insensitive.
>
> It is incorrect to assume that indexed "btree-equal" datums allow HOT
> updates. The user can not be returned an old datum in index-only
> scans, even if it's sorted the same as the new datum -- after all, a
> function on the datum may return different results even if the datums
> are otherwise equal. Think: count_uppercase(string). See also below at
> "HOT, datum compare, etc".
Thanks for pointing out the oversight for index-oriented scans (IOS),
you're right that the code in v22 doesn't handle that correctly. I'll
fix that. I still think that indexes that don't support IOS can and
should use the type-specific equality checks. This opens the door to
HOT with custom types that have unusual equality rules (see BSON).
> With the addition of rd_indexedattr we now have 6 bitmaps in
> RelationData, which generally only get accessed through
> RelationGetIndexAttrBitmap by an enum value. Maybe it's now time to
> bite the bullet and change that to a more general approach with
> `Bitmapset *rd_bitmaps[NUM_INDEX_ATTR_BITMAP]`? That way, the hot
> path in RelationGetIndexAttrBitmap would not depend on the compiler to
> determine that the fast path can do simple offset arithmatic to get
> the requested bitmap.
More than a few of those bitmaps in RelationData are purely for the HOT
tests, yes. I'll review those and see if I can shrink the set
meaningfully. It was something that had occurred to me too. I'll
review and consider ways to consolidate them.
> 0003:
> This looks like it's a cleanup patch for 0002, and doesn't have much
> standing on its own. Maybe the changes for ri_ChangedIndexedCols can
> all be moved into 0003? I think that gives this patch more weight and
> standing.
The goal in this patch was to show that we could eliminate the redundant
set work of index_unchanged_by_update() in execIndexing's update path.
Separating it out was done to make it more easily reviewed and to prove
that before/after tests passed making the change safe.
> 0004:
> This does two things:
> 1. Add index expression evaluation to the toolset to determine which
> indexes were unchanged, and
> 2. Allow index access methods to say "this value has not changed"
> even if the datum itself may have changed.
Yes, that's what it does. :)
> Could that be split up into two different patches?
Maybe, I might be able to add the index AM piece first and then the
expression piece.
> (aside: 2 makes a lot of sense in some cases, like trgm indexes
> strings if no new trigrams are added/removed, so I really like the
> idea behind this change).
Nice, I appreciate that.
> HOT, datum compare, etc.:
>
> Note that for index-only scans on an index to return correct results,
> you _must_ update the index (and thus, do a non-HOT update) whenever a
> value changes its binary datum, even if the value has the same btree
> sort location as the old value.
Yes, I get this now. I also wasn't testing the INCLUDING (non-key)
columns. I've fixed both of those in v23.
> Even for non-IOS-supporting indexes,
> the index may need more information than what's used in btree
> comparisons when it has a btree opclass.
It's not always just the btree opclass for equality, but that is common.
> As SP/GIST have IOS support, they also need to compare the image of
> the datum and not use ordinary equality as defined in nbtree's compare
> function: the value must be exactly equal to what the table AM
> would've provided.
In v23 I've changed the logic to use datumIsEqual() for any index that
supports IOS and doesn't supply a custom amcomparedatums() function.
> Primary example: Btree compares the `numeric` datums of `1.0` and
> `1.00` as equal, but for a user there is an observable difference; the
> following SQL must return `1.0` in every valid plan:
>
> BEGIN;
> INSERT INTO mytab (mynumeric) VALUES ('1.00');
> UPDATE mytab SET mynumeric = '1.0';
> SELECT mynumeric FROM mytab;
I'll try to reproduce this and add a test if I can.
> So, IMO, the default datum compare in ExecCheckIndexedAttrsForChanges
> and friends should just use datumIsEqual, and not this new
> tts_attr_equal.
Sure, and that's the case in v23. tts_attr_equal() still has value in
other cases so it's not gone.
> Indexes without IOS support might be able to opt into using a more lax
> datum comparator, but 1.) it should never be the default, as it'd be a
> loaded footgun for IndexAM implementers, and 2.) should not depend on
> another AM's understanding of attributes, as that is a very leaky
> abstraction.
Agree, which is why the default for IOS indexes is datumIsEqual() now.
> Kind regards,
>
> Matthias van de Meent
> Databricks (https://www.databricks.com)
Thanks again for the time and renewed interest in the patch! I've also
added a lot more tests into the heap_hot_updates.sql regression suite,
likely too many, but for now it's good to be testing the corners.
v23 attached, changes are all in 0004, best.
-greg
| Attachment | Content-Type | Size |
|---|---|---|
| v23-0001-Reorganize-heap-update-logic.patch | application/octet-stream | 47.6 KB |
| v23-0002-Track-changed-indexed-columns-in-the-executor-du.patch | application/octet-stream | 34.3 KB |
| v23-0003-Replace-index_unchanged_by_update-with-ri_Change.patch | application/octet-stream | 8.3 KB |
| v23-0004-Enable-HOT-updates-for-expression-and-partial-in.patch | application/octet-stream | 191.5 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Greg Burd | 2025-11-22 21:43:30 | Re: [PATCH] Fix ARM64/MSVC atomic memory ordering issues on Win11 by adding explicit DMB barriers |
| Previous Message | Joel Jacobson | 2025-11-22 21:30:11 | Re: Optimize LISTEN/NOTIFY |