| From: | Tomas Vondra <tomas(at)vondra(dot)me> |
|---|---|
| To: | Ants Aasma <ants(dot)aasma(at)cybertec(dot)at> |
| Cc: | Alexandre Felipe <alexandre(dot)felipe(at)tpro(dot)io>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org> |
| Subject: | Re: New access method for b-tree. |
| Date: | 2026-02-03 22:41:57 |
| Message-ID: | 4be96b19-1e19-4000-b65a-eada001e5a9a@vondra.me |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On 2/3/26 22:42, Ants Aasma wrote:
> On Mon, 2 Feb 2026 at 01:54, Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>> I'm also wondering how common is the targeted query pattern? How common
>> it is to have an IN condition on the leading column in an index, and
>> ORDER BY on the second one?
>
> I have seen this pattern multiple times. My nickname for it is the
> timeline view. Think of the social media timeline, showing posts from
> all followed accounts in timestamp order, returned in reasonably sized
> batches. The naive SQL query will have to scan all posts from all
> followed accounts and pass them through a top-N sort. When the total
> number of posts is much larger than the batch size this is much slower
> than what is proposed here (assuming I understand it correctly) -
> effectively equivalent to running N index scans through Merge Append.
>
Makes sense. I guess filtering products by category + order by price
could also produce this query pattern.
> My workarounds I have proposed users have been either to rewrite the
> query as a UNION ALL of a set of single value prefix queries wrapped
> in an order by limit. This gives the exact needed merge append plan
> shape. But repeating the query N times can get unwieldy when the
> number of values grows, so the fallback is:
>
> SELECT * FROM unnest(:friends) id, LATERAL (
> SELECT * FROM posts
> WHERE user_id = id
> ORDER BY tstamp DESC LIMIT 100)
> ORDER BY tstamp DESC LIMIT 100;
>
> The downside of this formulation is that we still have to fetch a
> batch worth of items from scans where we otherwise would have only had
> to look at one index tuple.
>
True. It's useful to think about the query this way, and it may be
better than full select + sort, but it has issues too.
> The main problem I can see is that at planning time the cardinality of
> the prefix array might not be known, and in theory could be in the
> millions. Having millions of index scans open at the same time is not
> viable, so the method needs to somehow degrade gracefully. The idea I
> had is to pick some limit, based on work_mem and/or benchmarking, and
> one the limit is hit, populate the first batch and then run the next
> batch of index scans, merging with the first result. Or something like
> that, I can imagine a few different ways to handle it with different
> tradeoffs.
>
Doesn't the proposed merge scan have a similar issue? Because that will
also have to keep all the index scans open (even if only internally).
Indeed, it needs to degrade gracefully, in some way. I'm afraid the
proposed batches execution will be rather complex, so I'd say v1 should
simply have a threshold, and do the full scan + sort for more items.
> I can imagine that this would really nicely benefit from ReadStream'ification.
>
Not sure, maybe.
> One other connection I see is with block nested loops. In a perfect
> future PostgreSQL could run the following as a set of merged index
> scans that terminate early:
>
> SELECT posts.*
> FROM follows f
> JOIN posts p ON f.followed_id = p.user_id
> WHERE f.follower_id = :userid
> ORDER BY p.tstamp DESC LIMIT 100;
>
> In practice this is not a huge issue - it's not that hard to transform
> this to array_agg and = ANY subqueries.
>
Automating that transformation seems quite non-trivial (to me).
regards
--
Tomas Vondra
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Nathan Bossart | 2026-02-03 22:42:53 | Re: refactor architecture-specific popcount code |
| Previous Message | Tom Lane | 2026-02-03 22:40:26 | Re: PostgreSQL 19 on AIX? |