Re: New access method for b-tree.

From: Ants Aasma <ants(dot)aasma(at)cybertec(dot)at>
To: Tomas Vondra <tomas(at)vondra(dot)me>
Cc: Alexandre Felipe <alexandre(dot)felipe(at)tpro(dot)io>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: New access method for b-tree.
Date: 2026-02-03 21:42:26
Message-ID: CANwKhkNd85u+4joaKR3YHoDOQSMg5SmJmsYJGo-tMyW=XVXTew@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, 2 Feb 2026 at 01:54, Tomas Vondra <tomas(at)vondra(dot)me> wrote:
> I'm also wondering how common is the targeted query pattern? How common
> it is to have an IN condition on the leading column in an index, and
> ORDER BY on the second one?

I have seen this pattern multiple times. My nickname for it is the
timeline view. Think of the social media timeline, showing posts from
all followed accounts in timestamp order, returned in reasonably sized
batches. The naive SQL query will have to scan all posts from all
followed accounts and pass them through a top-N sort. When the total
number of posts is much larger than the batch size this is much slower
than what is proposed here (assuming I understand it correctly) -
effectively equivalent to running N index scans through Merge Append.

My workarounds I have proposed users have been either to rewrite the
query as a UNION ALL of a set of single value prefix queries wrapped
in an order by limit. This gives the exact needed merge append plan
shape. But repeating the query N times can get unwieldy when the
number of values grows, so the fallback is:

SELECT * FROM unnest(:friends) id, LATERAL (
SELECT * FROM posts
WHERE user_id = id
ORDER BY tstamp DESC LIMIT 100)
ORDER BY tstamp DESC LIMIT 100;

The downside of this formulation is that we still have to fetch a
batch worth of items from scans where we otherwise would have only had
to look at one index tuple.

The main problem I can see is that at planning time the cardinality of
the prefix array might not be known, and in theory could be in the
millions. Having millions of index scans open at the same time is not
viable, so the method needs to somehow degrade gracefully. The idea I
had is to pick some limit, based on work_mem and/or benchmarking, and
one the limit is hit, populate the first batch and then run the next
batch of index scans, merging with the first result. Or something like
that, I can imagine a few different ways to handle it with different
tradeoffs.

I can imagine that this would really nicely benefit from ReadStream'ification.

One other connection I see is with block nested loops. In a perfect
future PostgreSQL could run the following as a set of merged index
scans that terminate early:

SELECT posts.*
FROM follows f
JOIN posts p ON f.followed_id = p.user_id
WHERE f.follower_id = :userid
ORDER BY p.tstamp DESC LIMIT 100;

In practice this is not a huge issue - it's not that hard to transform
this to array_agg and = ANY subqueries.

Regards,
Ants Aasma

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2026-02-03 21:53:07 Re: Non-deterministic buffer counts reported in execution with EXPLAIN ANALYZE BUFFERS
Previous Message Nathan Bossart 2026-02-03 21:25:54 Re: pg_dumpall --roles-only interact with other options