Re: New access method for b-tree.

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Cc: Alexandre Felipe <alexandre(dot)felipe(at)tpro(dot)io>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: New access method for b-tree.
Date: 2026-02-03 22:25:36
Message-ID: b0c06e87-1c0a-43d6-8432-5704fbf131ab@vondra.me
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2/3/26 17:01, Matthias van de Meent wrote:
> On Mon, 2 Feb 2026 at 00:54, Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>>
>> Hello Felipe,
>>
>> On 2/1/26 11:02, Alexandre Felipe wrote:
>>> Hello Hackers,
>>>
>>> Please check this out,
>>>
>>> It is an access method to scan a table sorting by the second column of
>>> an index, filtered on the first.
>>> Queries like
>>> SELECT x, y FROM grid
>>> WHERE x in (array of Nx elements)
>>> ORDER BY y, x
>>> LIMIT M
>>>
>>> Can execute streaming the rows directly from disk instead of loading
>>> everything.
>
> +1 for the idea, it does sound interesting. I haven't looked in depth
> at the patch, so no comments on the execution yet.
>
>> So how does this compare to skip scan in practice? It's hard to compare,
>> as the patch does not implement an actual access path, but I tried this:
> [...]
>> 1) master
>>
>> SELECT * FROM merge_test_int
>> WHERE prefix_col IN (1,3,4,5,6,7,8,9,10,11,12,13,14,15)
> [...]
>> 2) merge scan
>>
>> SELECT * FROM test_btree_merge_scan_int(
> [...]
>> ARRAY[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],
> [...]
>> And with explain analyze we get this:
>>
>> 1) Buffers: shared hit=26 read=25
>> 2) Buffers: shared hit=143 read=17
>
> (FYI; your first query was missing "2" from it's IN list while it was
> present in the merge scan input; this makes the difference worse by a
> few pages)
>
>> So it seems to access many more buffers, even if the number of reads is
>> lower. Presumably the merge scan is not always better than skip scan,
>> probably depending on number of prefixes in the query etc. What is the
>> cost model to decide between those two?
>
> Skip scan always returns data in index order, while this merge scan
> would return tuples a suffix order. The cost model would thus weigh
> the cost of sorting the result of an index skipscan against the cost
> of doing a merge join on n_in_list_items distinct (partial) index
> scans.
>

Makes sense.

> As for when you would benefit in buffers accessed: The merge scan
> would mainly benefit in number of buffers accessed when the selected
> prefix values are non-sequential, and the prefixes cover multiple
> pages at a time, and when there is a LIMIT clause on the scan. Normal
> btree index skip scan infrastructure efficiently prevents new index
> descents into the index when the selected SAOP key ranges are directly
> adjecent, while merge scan would generally do at least one index
> descent for each of its N scan heads (*) - which in the proposed
> prototype patch guarantees O(index depth * num scan heads) buffer
> accesses.
>

Do we have sufficient information to reliably make the right decision?
Can we actually cost the two cases well enough?

> (*) It is theoretically possible to reuse an earlier index descent if
> the SAOP entry's key range of the last descent starts and ends on the
> leaf page that the next SAOP entry's key range also starts on
> (applying the ideas of 5bf748b86b to this new multi-headed index scan
> mode), but that infrastructure doesn't seem to be in place in the
> current patch. That commit is also why your buffer access count for
> master is so low compared to the merge scan's; if your chosen list of
> numbers was multiples of 5 (so that matching tuples are not all
> sequential) you'd probably see much more comparable buffer access
> counts.
>
>> If you had to construct the best case and worst cases (vs. skip scan),
>> what would that look like?
>
> Presumably the best case would be:
>
> -- mytable.a has very few distinct values (e.g. bool or enum);
> mytable.b many distinct values (e.g. uuid)
> SELECT * FROM mytable WHERE a IN (1, 2) ORDER BY b;
>
> which the index's merge scan would turn into an index scan that
> behaves similar to the following, possibly with the merge join pushed
> down into the index:
>
> SELECT * FROM (
> SELECT ... FROM mytable WHERE a = 1
> UNION
> SELECT ... FROM mytable WHERE a = 2
> ) ORDER BY b.
>
>
> The worst case would be the opposite:
>
> -- mytable.a has many distinct values (random uuid); mytable.b few
> (e.g. boolean; enum)
> SELECT * FROM mytable WHERE a IN (... huge in list) ORDER BY b
>
> As the merge scan maintains one internal indexscan head per SAOP array
> element, it'd have significant in-memory and scan startup overhead,
> while few values are produced for each of those scan heads.
>

OK. It'll be interesting to see how this performs in practice for the
whole gamut between the best and worst case.

>> I'm also wondering how common is the targeted query pattern? How common
>> it is to have an IN condition on the leading column in an index, and
>> ORDER BY on the second one?
>
> I'm not sure, but it seems like it might be common/useful in
> queue-like access patterns:
>
> With an index on (state, updated_timestamp) you're probably interested
> in all messages in just a subset of states, ordered by recent state
> transitions. An index on (updated_timestamp, state) might be
> considered more optimal, but won't be able to efficiently serve
> queries that only want data on uncommon states: The leaf pages would
> mainly contain data on common states, reducing the value of those leaf
> pages.
>
> Right now, you can rewrite the "prefix IN (...) ORDER BY SUFFIX" query
> using UNION, or add an index for each percievable IN list, but it'd be
> great if the user didn't have to rewrite their query or create
> n_combinations indexes with their respective space usage to get this
> more efficient query execution.
>

I think the examples presented by Ants (with timeline view) are quite
plausible in practice.

regards

--
Tomas Vondra

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2026-02-03 22:29:46 Re: Decoupling our alignment assumptions about int64 and double
Previous Message Mark Hill 2026-02-03 22:23:03 PostgreSQL 19 on AIX?