Re: New access method for b-tree.

From: Alexandre Felipe <o(dot)alexandre(dot)felipe(at)gmail(dot)com>
To: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, michael(at)paquier(dot)xyz, tgl(at)sss(dot)pgh(dot)pa(dot)us, "peter(at)eisentraut(dot)org" <peter(at)eisentraut(dot)org>
Cc: Ants Aasma <ants(dot)aasma(at)cybertec(dot)at>, Tomas Vondra <tomas(at)vondra(dot)me>, Alexandre Felipe <alexandre(dot)felipe(at)tpro(dot)io>, Michał Kłeczek <michal(at)kleczek(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: New access method for b-tree.
Date: 2026-03-17 12:37:47
Message-ID: CAE8JnxM06S_B+2N-VU7LqwMhbf4pp94nnUTB=whZ2ou56++zVg@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Happy St. Patrick's day!

Based on what I said said in previous emails I see alternative
proposals

#1 Make it simpler by not changing the index access methods.

#2 Make it optimal in some sense by not using generic index searches
and not keeping multiple open index scans.

and
#3 Follow the pragmatic approach
Objective is, minimize the number of heap fetches.
As high level as possible, reusing existing functions
instead of writing custom code when possible.

Ants Aasma & Tomas Vondra

> > My workarounds I have proposed users have been either to rewrite the
> > query as a UNION ALL of a set of single value prefix queries wrapped
> > in an order by limit. This gives the exact needed merge append plan
> > shape. But repeating the query N times can get unwieldy when the
> > number of values grows, so the fallback is:
> >
> > SELECT * FROM unnest(:friends) id, LATERAL (
> > SELECT * FROM posts
> > WHERE user_id = id
> > ORDER BY tstamp DESC LIMIT 100)
> > ORDER BY tstamp DESC LIMIT 100;
> >
> > The downside of this formulation is that we still have to fetch a
> > batch worth of items from scans where we otherwise would have only had
> > to look at one index tuple.
> >

> True. It's useful to think about the query this way, and it may be
> better than full select + sort, but it has issues too.
>

An issue with this query is generality, if this is joined with other
queries we can't determine in advance the limit.

> The main problem I can see is that at planning time the cardinality of
> > the prefix array might not be known, and in theory could be in the
> > millions. Having millions of index scans open at the same time is not
> > viable, so the method needs to somehow degrade gracefully. The idea I
> > had is to pick some limit, based on work_mem and/or benchmarking, and
> > one the limit is hit, populate the first batch and then run the next
> > batch of index scans, merging with the first result. Or something like
> > that, I can imagine a few different ways to handle it with different
> > tradeoffs.
> >
>
> Doesn't the proposed merge scan have a similar issue? Because that will
> also have to keep all the index scans open (even if only internally).
> Indeed, it needs to degrade gracefully, in some way.

It is true, but I think we can trust the planner.
This problem scales similarly in a memoize node.
Is ~24kB for each open index scan a good guess?

ALTERNATIVE #1 - More efficient

Or to avoid having N open index scans we could (??)
(1) find the index page for the head of each prefix.
(2) for each prefix
(2.a) load tuples from each head page
(2.b) if we consume the last tuple in a page save a pointer
to the next page.
(2.c) check if tuples for the next prefix are in the same page
(2.d) Release the page.
(3) producing tuples in the suffix order
(3.b) when tuples for prefix are exhausted load load
page from (2.b)

Step 2.a. would possibly waste time extracting tuples that are
not needed, and memory by storing them. Not sure how efficient
this can be compared to having an open index scan.

Matthias van de Meent, Feb 3
> btree index skip scan infrastructure efficiently prevents new index
> descents into the index when the selected SAOP key ranges are directly
> adjecent, while merge scan would generally do at least one index
> descent for each of its N scan heads (*) - which in the proposed
> prototype patch guarantees O(index depth * num scan heads) buffer
> accesses.

This could also be addressed if we do this custom descent,
I didn't bother about that depth factor because with a few random prefixes
doing so we are probably going to save accesses only for the top level.

I would prefer to start with a very conceptual implementation
that can already provide 1000x speedup, but if you think this
way is better, I am open to try it. I think this can be done
without affecting the planner logic and the PrefixJoin node.

I'm afraid the
> proposed batches execution will be rather complex, so I'd say v1 should
> simply have a threshold, and do the full scan + sort for more items.

Do you mean by an executor node that performs the query as if it was written

ALTERNATIVE #2 - Simpler(??)
for each _prefix of prefixes:
result += (SELECT FROM table
WHERE prefix = _prefix AND qual(*)
ORDER BY suffix
LIMIT N)
return SELECT * FROM result
ORDER BY suffix
LIMIT N

This query may have to produce N * len(prefixes) rows, while the
original proposal would produce only N + len(prefixes) - 1.

Comparing to the previous results,
> | Method | Shared Hit | Shared Read | Exec Time |
> |------------|-----------:|------------:|----------:|
> | Merge | 13 | 119 | 13 ms |
> | IndexScan | 15,308 | 525,310 | 3,409 ms |

This Prefix Batch Scan approach
hit=62 read=773, Execution Time: 80.815 ms
With 8 prefixes, the execution time increased to 80.8/13 ~ 6.21
And the number of buffers by 835/132 ~ 6.32

> I can imagine that this would really nicely benefit from
> ReadStream'ification.
> >
>
> Not sure, maybe.
>

Actually as I was watching the index prefetch development I was
quite uncertain about how this would play with that, but we can
probably simply give a budget for each stream.

> One other connection I see is with block nested loops. In a perfect
> > future PostgreSQL could run the following as a set of merged index
> > scans that terminate early:
> >
> > SELECT posts.*
> > FROM follows f
> > JOIN posts p ON f.followed_id = p.user_id
> > WHERE f.follower_id = :userid
> > ORDER BY p.tstamp DESC LIMIT 100;
> >
> > In practice this is not a huge issue - it's not that hard to transform
> > this to array_agg and = ANY subqueries.
> >

Automating that transformation seems quite non-trivial (to me).
>

Well, not trivial. To give a rough idea.

wc -l *.patch
113 v2-0001-Test-the-baseline.patch
614 v2-0002-Access-method.patch
850 v2-0003-Planner-integration.patch
1958 v2-0004-Multi-column.patch
2439 v2-0005-Joins.patch

it is missing some important details like prefix deduplication
but for the scenario where the values on the other table
are known to be unique it is good.

The multi column accepts things like A in (...) B in (...)
and computes the cartesian product or (A, B) IN (...)

Regards,
Alexandre

On Mon, Feb 23, 2026 at 10:08 PM Alexandre Felipe <
o(dot)alexandre(dot)felipe(at)gmail(dot)com> wrote:

> Hi Hackers,
>
> Do you think that MERGE-SCAN was a terrible name? I wanted a name that
> wouldn't
>
> require much explanation. I named it like this because it relies on a k-way
>
>
> merge to combine several segments of an index in one result. But we already
>
>
> have a MERGE statement. Even in the example plan above we can see
> an external
>
> merge that has nothing to do with the new feature, and now as I am doing
> joins,
>
> I started doing it on the NestedLoop trying to follow the same conditions
> that
>
> lead to a memoize. But I added so many fields to the NestedLoop state that
> I
>
> think it is good to have a separate structure, and maybe a separate node,
> and
>
> MergeScan of course is taken hehe. I was thinking of IndexPrefixMerge. We
> could
>
> use the Ants nickname TimeLineScan, but of course it is not limited to time
>
>
> lines (even though realistically, that will probably be the most common
> use of
>
> this). Another one I considered was TransposedIndexScan, because it orders
>
>
> output on (suffix, prefix) instead of (prefix, suffix).
>
>
>
>
> On Fri, Feb 6, 2026 at 10:52 AM Alexandre Felipe <
> o(dot)alexandre(dot)felipe(at)gmail(dot)com> wrote:
>
>> Hello again hackers!
>>
>> +pt(at)bowt(dot)ie <pt(at)bowt(dot)ie>: That seems to be the one that is probably the
>> most familiar with the index scan (based on the commits).
>> +michael(at)paquier(dot)xyz <michael(at)paquier(dot)xyz> , +tgl(at)sss(dot)pgh(dot)pa(dot)us
>> <tgl(at)sss(dot)pgh(dot)pa(dot)us> , +peter(at)eisentraut(dot)org <peter(at)eisentraut(dot)org> as
>> the top 3 committers to nbtree over the last ~6 months.
>>
>> I have made substantial progress on adding a few features. I have
>> questions, but I will let you go first :)
>>
>> Motivation:
>> *In technical terms:* this proposal is to take advantage of a btree
>> index when the query is filtered by a few distinct prefixes and ordered by
>> a suffix and has a limit.
>> *In non technical:* This could help to efficiently render a social
>> network feed, where each user can select a list of users whose posts they
>> want to see, and the posts must be ordered from newest to oldest.
>>
>>
>> *Performance Comparison*
>> I did a test with a toy table, please find more details below.
>>
>> With limit 100
>>
>> | Method | Shared Hit | Shared Read | Exec Time |
>> |------------|-----------:|------------:|----------:|
>> | Merge | 13 | 119 | 13 ms |
>> | IndexScan | 15,308 | 525,310 | 3,409 ms |
>>
>> With limit 1,000,000
>>
>> | Method | SharedHit | SharRead | Temp I | Temp O | Exec Time |
>> |------------|-----------:|---------:|-------:|-------:|----------:|
>> | Merge | 980,318 | 19,721 | 0 | 0 | 2,128 ms |
>> | Sequential | 15,208 | 525,410 | 20,207 | 35,384 | 3,762 ms |
>> | Bitmap | 629 | 113,759 | 20,207 | 35,385 | 5,487 ms |
>> | IndexScan | 7,880,619 | 126,706 | 20,945 | 35,386 | 5,874 ms |
>>
>> Sequential scans and bitmap scans in this case reduces significantly the
>> number of
>> accessed buff because the table has only four integer columns, and these
>> methods
>> can read all the lines on a given page at a time.
>>
>> However that comes at the cost of resorting to an in-disk sort method.
>> For the query with limit 100 we get no temp files as we are using a
>> top-100 sort.
>>
>> make check passes
>>
>>
>> *Experiment details*
>>
>> Consider a 100M row table formed (a,b,c,d) \in 100 x 100 x 100 x 100
>>
>>
>> ```sql
>> CREATE TABLE grid AS (
>> SELECT a, b, c, d, FROM
>> generate_series(1, 100) AS a,
>> generate_series(1, 100) AS b,
>> generate_series(1, 100) AS c,
>> generate_series(1, 100) AS d
>> );
>>
>> CREATE INDEX grid_index ON grid (a, b, c);
>> ANALYSE grid;
>> ```
>>
>> Now let's say that we need to find certain number of rows filtered by a
>> and ordered by b;
>> ```sql
>> PREPARE grid_query(int) AS
>> SELECT sum(d) FROM (
>> SELECT * FROM grid
>> WHERE a IN (2,3,5,8,13,21,34,55) AND b >= 0
>> ORDER BY b
>> LIMIT $1) t;
>> ```
>>
>> ---
>>
>>
>> Now with limit 100, with index merge scan (notice Index Prefixes in the
>> plan).
>>
>> ```sql
>> SET enable_indexmergescan = on;
>> EXPLAIN (ANALYSE) EXECUTE grid_query(100);
>> ```
>>
>> ```text
>> Buffers: shared hit=13 read=119
>> -> Limit (cost=0.57..87.29 rows=100 width=16) (actual
>> time=5.528..12.999 rows=100.00 loops=1)
>> Buffers: shared hit=13 read=119
>> -> Index Scan using grid_a_b_c_idx on grid (cost=0.57..93.36
>> rows=107 width=16) (actual time=5.528..12.994 rows=100.00 loops=1)
>> Index Cond: (b >= 0)
>> *Index Prefixes: *(a = ANY
>> ('{2,3,5,8,13,21,34,55}'::integer[]))
>> Index Searches: 8
>> Buffers: shared hit=13 read=119
>> Planning:
>> Buffers: shared hit=59 read=23
>> Planning Time: 4.619 ms
>> Execution Time: 13.055 ms
>> ```
>>
>>
>> ```sql
>> SET enable_indexmergescan = off;
>> EXPLAIN (ANALYSE) EXECUTE grid_query(100);
>> ```
>>
>> ```text
>> Aggregate (cost=1603588.06..1603588.07 rows=1 width=8) (actual
>> time=3406.624..3408.710 rows=1.00 loops=1)
>> Buffers: shared hit=15308 read=525310
>> -> Limit (cost=1603575.17..1603586.81 rows=100 width=16) (actual
>> time=3406.601..3408.702 rows=100.00 loops=1)
>> Buffers: shared hit=15308 read=525310
>> -> Gather Merge (cost=1603575.17..2514342.92 rows=7819999
>> width=16) (actual time=3406.598..3408.695 rows=100.00 loops=1)
>> Workers Planned: 2
>> Workers Launched: 2
>> Buffers: shared hit=15308 read=525310
>> -> Sort (cost=1602575.14..1610720.98 rows=3258333
>> width=16) (actual time=3393.782..3393.784 rows=100.00 loops=3)
>> Sort Key: grid.b
>> Sort Method: top-N heapsort Memory: 32kB
>> Buffers: shared hit=15308 read=525310
>> Worker 0: Sort Method: top-N heapsort Memory: 32kB
>> Worker 1: Sort Method: top-N heapsort Memory: 32kB
>> -> *Parallel Seq Scan* on grid
>> (cost=0.00..1478044.00 rows=3258333 width=16) (actual time=0.944..3129.896
>> rows=2666666.67 loops=3)
>> Filter: ((b >= 0) AND (a = ANY
>> ('{2,3,5,8,13,21,34,55}'::integer[])))
>> Rows Removed by Filter: 30666667
>> Buffers: shared hit=15234 read=525310
>> Planning Time: 0.370 ms
>> Execution Time: 3409.134 ms
>> ```
>>
>> Now queries with limit 1,000,000
>>
>> ```sql
>> SET enable_indexmergescan = on;
>> EXPLAIN ANALYSE EXECUTE grid_query(1000000);
>> ```
>>
>> Query executed with the proposed access method. Notice in the plan Index
>> Prefixes and Index Cond.
>> ```text
>> Buffers: shared hit=980318 read=19721
>> -> Limit (cost=0.57..867259.84 rows=1000000 width=16) (actual
>> time=2.854..2103.438 rows=1000000.00 loops=1)
>> Buffers: shared hit=980318 read=19721
>> -> Index Scan using grid_a_b_c_idx on grid
>> (cost=0.57..867265.91 rows=1000007 width=16) (actual time=2.852..2066.205
>> rows=1000000.00 loops=1)
>> Index Cond: (b >= 0)
>> *Index Prefixes:* (a = ANY
>> ('{2,3,5,8,13,21,34,55}'::integer[]))
>> Index Searches: 8
>> Buffers: shared hit=980318 read=19721
>> Planning Time: 0.328 ms
>> Execution Time: 2127.811 ms
>> ```
>>
>> If we disable index_mergescan we naturally we fall into a sequential scan.
>>
>> ```sql
>> SET enable_indexmergescan = off;
>> EXPLAIN ANALYSE EXECUTE grid_query(1000000);
>> ```
>> ```text
>> Buffers: shared hit=15208 read=525410, temp read=20207 written=35384
>> -> Limit (cost=1942895.64..2059362.12 rows=1000000 width=16) (actual
>> time=3467.012..3712.044 rows=1000000.00 loops=1)
>> Buffers: shared hit=15208 read=525410, temp read=20207
>> written=35384
>> -> Gather Merge (cost=1942895.64..2853663.39 rows=7819999
>> width=16) (actual time=3467.010..3671.220 rows=1000000.00 loops=1)
>> Workers Planned: 2
>> Workers Launched: 2
>> Buffers: shared hit=15208 read=525410, temp read=20207
>> written=35384
>> -> Sort (cost=1941895.62..1950041.45 rows=3258333
>> width=16) (actual time=3455.852..3476.358 rows=334576.33 loops=3)
>> Sort Key: grid.b
>> Sort Method: *external merge Disk: 47016kB*
>> Buffers: shared hit=15208 read=525410, temp
>> read=20207 written=35384
>> Worker 0: Sort Method: external merge Disk: 46976kB
>> Worker 1: Sort Method: external merge Disk: 47000kB
>> -> *Parallel Seq Scan* on grid
>> (cost=0.00..1478044.00 rows=3258333 width=16) (actual time=2.789..2779.483
>> rows=2666666.67 loops=3)
>> Filter: ((b >= 0) AND (a = ANY
>> ('{2,3,5,8,13,21,34,55}'::integer[])))
>> Rows Removed by Filter: 30666667
>> Buffers: shared hit=15134 read=525410
>> Planning Time: 0.332 ms
>> Execution Time: 3761.866 ms
>> ```
>>
>> If we disable sequential scans, then we get a bitmap scan
>>
>> ```sql
>> SET enable_seqscan = off;
>> EXPLAIN ANALYSE EXECUTE grid_query(1000000);
>> ```
>> ```text
>> Buffers: shared hit=629 read=113759 written=2, temp read=20207
>> written=35385
>> -> Limit (cost=1998199.78..2114666.26 rows=1000000 width=16) (actual
>> time=5170.456..5453.433 rows=1000000.00 loops=1)
>> Buffers: shared hit=629 read=113759 written=2, temp read=20207
>> written=35385
>> -> Gather Merge (cost=1998199.78..2908967.53 rows=7819999
>> width=16) (actual time=5170.455..5413.254 rows=1000000.00 loops=1)
>> Workers Planned: 2
>> Workers Launched: 2
>> Buffers: shared hit=629 read=113759 written=2, temp
>> read=20207 written=35385
>> -> Sort (cost=1997199.75..2005345.59 rows=3258333
>> width=16) (actual time=5156.929..5177.507 rows=334500.67 loops=3)
>> Sort Key: grid.b
>> Sort Method: external merge Disk: 47032kB
>> Buffers: shared hit=629 read=113759 written=2, temp
>> read=20207 written=35385
>> Worker 0: Sort Method: external merge Disk: 47280kB
>> Worker 1: Sort Method: external merge Disk: 46680kB
>> -> Parallel Bitmap Heap Scan on grid
>> (cost=107691.54..1533348.13 rows=3258333 width=16) (actual
>> time=299.891..4489.787 rows=2666666.67 loops=3)
>> Recheck Cond: ((a = ANY
>> ('{2,3,5,8,13,21,34,55}'::integer[])) AND (b >= 0))
>> Rows *Removed by Index Recheck*: 2410242
>> Heap Blocks: exact=13100 lossy=22639
>> Buffers: shared hit=615 read=113759 written=2
>> Worker 0: Heap Blocks: exact=13077 lossy=22755
>> Worker 1: Heap Blocks: exact=13036 lossy=22421
>> -> *Bitmap Index Scan* on grid_a_b_c_idx
>> (cost=0.00..105736.54 rows=7820000 width=0) (actual time=297.651..297.651
>> rows=8000000.00 loops=1)
>> Index Cond: ((a = ANY
>> ('{2,3,5,8,13,21,34,55}'::integer[])) AND (b >= 0))
>> Index Searches: 7
>> Buffers: shared hit=13 read=7293
>> written=2
>> Planning Time: 0.165 ms
>> Execution Time: 5487.213 ms
>> ```
>>
>> If we disable bitmap scans we finally get an index scan
>>
>> ```sql
>> SET enable_bitmapscan = off;
>> EXPLAIN ANALYSE EXECUTE grid_query(1000000);
>> ```
>> ```
>> Buffers: shared hit=7883221 read=124111, temp read=20699 written=35385
>> -> Limit (cost=7201203.08..7317669.55 rows=1000000 width=16) (actual
>> time=4414.478..4674.400 rows=1000000.00 loops=1)
>> Buffers: shared hit=7883221 read=124111, temp read=20699
>> written=35385
>> -> Gather Merge (cost=7201203.08..8111970.83 rows=7819999
>> width=16) (actual time=4414.476..4633.982 rows=1000000.00 loops=1)
>> Workers Planned: 2
>> Workers Launched: 2
>> Buffers: shared hit=7883221 read=124111, temp read=20699
>> written=35385
>> -> Sort (cost=7200203.05..7208348.88 rows=3258333
>> width=16) (actual time=4390.625..4411.896 rows=334567.00 loops=3)
>> Sort Key: grid.b
>> Sort Method: *external merge Disk: 47304kB*
>> Buffers: shared hit=7883221 read=124111, temp
>> read=20699 written=35385
>> Worker 0: Sort Method: external merge Disk: 47304kB
>> Worker 1: Sort Method: external merge Disk: 46384kB
>> -> *Parallel Index Scan* using grid_a_b_c_idx on
>> grid (cost=0.57..6736351.43 rows=3258333 width=16) (actual
>> time=46.925..3796.915 rows=2666666.67 loops=3)
>> Index Cond: ((a = ANY
>> ('{2,3,5,8,13,21,34,55}'::integer[])) AND (b >= 0))
>> Index Searches: 7
>> Buffers: shared hit=7883208 read=124110
>> Planning Time: 0.385 ms
>> Execution Time: 4713.325 ms
>> ```
>>
>>
>>
>>
>>
>>
>> On Thu, Feb 5, 2026 at 6:59 AM Alexandre Felipe <
>> o(dot)alexandre(dot)felipe(at)gmail(dot)com> wrote:
>>
>>> Thank you for looking into this.
>>>
>>> Now we can execute a, still narrow, family queries!
>>>
>>> Maybe it helps to see this as a *social network feeds*. Imagine a
>>> social network, you have a few friends, or follow a few people, and you
>>> want to see their updates ordered by date. For each user we have a
>>> different combination of users that we have to display. But maybe, even
>>> having hundreds of users we will only show the first 10.
>>>
>>> There is a low hanging fruit on the skip scan, if we need N rows, and
>>> one group already has M rows we could stop there.
>>> If Nx is the number of friends, and M is the number of posts to show.
>>> This runs with complexity (Nx * M) rows, followed by an (Nx * M) sort,
>>> instead of (Nx * N) followed by an (Nx * N) sort.
>>> Where M = 10 and N is 1000 this is a significant improvement.
>>> But if M ~ N, the merge scan that runs with M + Nx row accesses, (M +
>>> Nx) heap operations.
>>> If everything is on the same page the skip scan would win.
>>>
>>> The cost estimation is probably far off.
>>> I am also not considering the filters applied after this operator, and I
>>> don't know if the planner infrastructure is able to adjust it by itself.
>>> This is where I would like reviewer's feedback. I think that the planner
>>> costs are something to be determined experimentally.
>>>
>>> Next I will make it slightly more general handling
>>> * More index columns: Index (a, b, s...) could support WHERE a IN (...)
>>> ORDER BY b LIMIT N (ignoring s...)
>>> * Multi-column prefix: WHERE (a, b) IN (...) ORDER BY c
>>> * Non-leading prefix: WHERE b IN (...) AND a = const ORDER BY c on index
>>> (a, b, c)
>>>
>>> ---
>>> Kind Regards,
>>> Alexandre
>>>
>>> On Wed, Feb 4, 2026 at 7:13 AM Michał Kłeczek <michal(at)kleczek(dot)org>
>>> wrote:
>>>
>>>>
>>>>
>>>> On 3 Feb 2026, at 22:42, Ants Aasma <ants(dot)aasma(at)cybertec(dot)at> wrote:
>>>>
>>>> On Mon, 2 Feb 2026 at 01:54, Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>>>>
>>>> I'm also wondering how common is the targeted query pattern? How common
>>>> it is to have an IN condition on the leading column in an index, and
>>>> ORDER BY on the second one?
>>>>
>>>>
>>>> I have seen this pattern multiple times. My nickname for it is the
>>>> timeline view. Think of the social media timeline, showing posts from
>>>> all followed accounts in timestamp order, returned in reasonably sized
>>>> batches. The naive SQL query will have to scan all posts from all
>>>> followed accounts and pass them through a top-N sort. When the total
>>>> number of posts is much larger than the batch size this is much slower
>>>> than what is proposed here (assuming I understand it correctly) -
>>>> effectively equivalent to running N index scans through Merge Append.
>>>>
>>>>
>>>> My workarounds I have proposed users have been either to rewrite the
>>>> query as a UNION ALL of a set of single value prefix queries wrapped
>>>> in an order by limit. This gives the exact needed merge append plan
>>>> shape. But repeating the query N times can get unwieldy when the
>>>> number of values grows, so the fallback is:
>>>>
>>>> SELECT * FROM unnest(:friends) id, LATERAL (
>>>> SELECT * FROM posts
>>>> WHERE user_id = id
>>>> ORDER BY tstamp DESC LIMIT 100)
>>>> ORDER BY tstamp DESC LIMIT 100;
>>>>
>>>> The downside of this formulation is that we still have to fetch a
>>>> batch worth of items from scans where we otherwise would have only had
>>>> to look at one index tuple.
>>>>
>>>>
>>>> GIST can be used to handle this kind of queries as it supports multiple
>>>> sort orders.
>>>> The only problem is that GIST does not support ORDER BY column.
>>>> One possible workaround is [1] but as described there it does not play
>>>> well with partitioning.
>>>> I’ve started drafting support for ORDER BY column in GIST - see [2].
>>>> I think it would be easier to implement and maintain than a new IAM
>>>> (but I don’t have enough knowledge and experience to implement it myself)
>>>>
>>>> [1]
>>>> https://www.postgresql.org/message-id/3FA1E0A9-8393-41F6-88BD-62EEEA1EC21F%40kleczek.org
>>>> [2]
>>>> https://www.postgresql.org/message-id/B2AC13F9-6655-4E27-BFD3-068844E5DC91%40kleczek.org
>>>>
>>>> —
>>>> Kind regards,
>>>> Michal
>>>>
>>>

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Bertrand Drouvot 2026-03-17 12:43:18 Re: Fix uninitialized xl_running_xacts padding
Previous Message Xuneng Zhou 2026-03-17 12:20:42 Re: BUG: Cascading standby fails to reconnect after falling back to archive recovery