| From: | Alexandre Felipe <o(dot)alexandre(dot)felipe(at)gmail(dot)com> |
|---|---|
| To: | "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, pt(at)bowt(dot)ie, michael(at)paquier(dot)xyz, tgl(at)sss(dot)pgh(dot)pa(dot)us, "peter(at)eisentraut(dot)org" <peter(at)eisentraut(dot)org> |
| Cc: | Ants Aasma <ants(dot)aasma(at)cybertec(dot)at>, Tomas Vondra <tomas(at)vondra(dot)me>, Alexandre Felipe <alexandre(dot)felipe(at)tpro(dot)io>, Michał Kłeczek <michal(at)kleczek(dot)org>, pg(at)bowt(dot)ie |
| Subject: | Re: New access method for b-tree. |
| Date: | 2026-02-06 10:52:01 |
| Message-ID: | CAE8JnxM5GDEWdvEckjgG60OwPK04pZ9dSyxYm2+-PuyKCpmo-w@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hello again hackers!
+pt(at)bowt(dot)ie <pt(at)bowt(dot)ie>: That seems to be the one that is probably the
most familiar with the index scan (based on the commits).
+michael(at)paquier(dot)xyz <michael(at)paquier(dot)xyz> , +tgl(at)sss(dot)pgh(dot)pa(dot)us
<tgl(at)sss(dot)pgh(dot)pa(dot)us> , +peter(at)eisentraut(dot)org <peter(at)eisentraut(dot)org> as the
top 3 committers to nbtree over the last ~6 months.
I have made substantial progress on adding a few features. I have
questions, but I will let you go first :)
Motivation:
*In technical terms:* this proposal is to take advantage of a btree index
when the query is filtered by a few distinct prefixes and ordered by a
suffix and has a limit.
*In non technical:* This could help to efficiently render a social network
feed, where each user can select a list of users whose posts they want to
see, and the posts must be ordered from newest to oldest.
*Performance Comparison*
I did a test with a toy table, please find more details below.
With limit 100
| Method | Shared Hit | Shared Read | Exec Time |
|------------|-----------:|------------:|----------:|
| Merge | 13 | 119 | 13 ms |
| IndexScan | 15,308 | 525,310 | 3,409 ms |
With limit 1,000,000
| Method | SharedHit | SharRead | Temp I | Temp O | Exec Time |
|------------|-----------:|---------:|-------:|-------:|----------:|
| Merge | 980,318 | 19,721 | 0 | 0 | 2,128 ms |
| Sequential | 15,208 | 525,410 | 20,207 | 35,384 | 3,762 ms |
| Bitmap | 629 | 113,759 | 20,207 | 35,385 | 5,487 ms |
| IndexScan | 7,880,619 | 126,706 | 20,945 | 35,386 | 5,874 ms |
Sequential scans and bitmap scans in this case reduces significantly the
number of
accessed buff because the table has only four integer columns, and these
methods
can read all the lines on a given page at a time.
However that comes at the cost of resorting to an in-disk sort method.
For the query with limit 100 we get no temp files as we are using a
top-100 sort.
make check passes
*Experiment details*
Consider a 100M row table formed (a,b,c,d) \in 100 x 100 x 100 x 100
```sql
CREATE TABLE grid AS (
SELECT a, b, c, d, FROM
generate_series(1, 100) AS a,
generate_series(1, 100) AS b,
generate_series(1, 100) AS c,
generate_series(1, 100) AS d
);
CREATE INDEX grid_index ON grid (a, b, c);
ANALYSE grid;
```
Now let's say that we need to find certain number of rows filtered by a and
ordered by b;
```sql
PREPARE grid_query(int) AS
SELECT sum(d) FROM (
SELECT * FROM grid
WHERE a IN (2,3,5,8,13,21,34,55) AND b >= 0
ORDER BY b
LIMIT $1) t;
```
---
Now with limit 100, with index merge scan (notice Index Prefixes in the
plan).
```sql
SET enable_indexmergescan = on;
EXPLAIN (ANALYSE) EXECUTE grid_query(100);
```
```text
Buffers: shared hit=13 read=119
-> Limit (cost=0.57..87.29 rows=100 width=16) (actual
time=5.528..12.999 rows=100.00 loops=1)
Buffers: shared hit=13 read=119
-> Index Scan using grid_a_b_c_idx on grid (cost=0.57..93.36
rows=107 width=16) (actual time=5.528..12.994 rows=100.00 loops=1)
Index Cond: (b >= 0)
*Index Prefixes: *(a = ANY
('{2,3,5,8,13,21,34,55}'::integer[]))
Index Searches: 8
Buffers: shared hit=13 read=119
Planning:
Buffers: shared hit=59 read=23
Planning Time: 4.619 ms
Execution Time: 13.055 ms
```
```sql
SET enable_indexmergescan = off;
EXPLAIN (ANALYSE) EXECUTE grid_query(100);
```
```text
Aggregate (cost=1603588.06..1603588.07 rows=1 width=8) (actual
time=3406.624..3408.710 rows=1.00 loops=1)
Buffers: shared hit=15308 read=525310
-> Limit (cost=1603575.17..1603586.81 rows=100 width=16) (actual
time=3406.601..3408.702 rows=100.00 loops=1)
Buffers: shared hit=15308 read=525310
-> Gather Merge (cost=1603575.17..2514342.92 rows=7819999
width=16) (actual time=3406.598..3408.695 rows=100.00 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=15308 read=525310
-> Sort (cost=1602575.14..1610720.98 rows=3258333
width=16) (actual time=3393.782..3393.784 rows=100.00 loops=3)
Sort Key: grid.b
Sort Method: top-N heapsort Memory: 32kB
Buffers: shared hit=15308 read=525310
Worker 0: Sort Method: top-N heapsort Memory: 32kB
Worker 1: Sort Method: top-N heapsort Memory: 32kB
-> *Parallel Seq Scan* on grid
(cost=0.00..1478044.00 rows=3258333 width=16) (actual time=0.944..3129.896
rows=2666666.67 loops=3)
Filter: ((b >= 0) AND (a = ANY
('{2,3,5,8,13,21,34,55}'::integer[])))
Rows Removed by Filter: 30666667
Buffers: shared hit=15234 read=525310
Planning Time: 0.370 ms
Execution Time: 3409.134 ms
```
Now queries with limit 1,000,000
```sql
SET enable_indexmergescan = on;
EXPLAIN ANALYSE EXECUTE grid_query(1000000);
```
Query executed with the proposed access method. Notice in the plan Index
Prefixes and Index Cond.
```text
Buffers: shared hit=980318 read=19721
-> Limit (cost=0.57..867259.84 rows=1000000 width=16) (actual
time=2.854..2103.438 rows=1000000.00 loops=1)
Buffers: shared hit=980318 read=19721
-> Index Scan using grid_a_b_c_idx on grid (cost=0.57..867265.91
rows=1000007 width=16) (actual time=2.852..2066.205 rows=1000000.00 loops=1)
Index Cond: (b >= 0)
*Index Prefixes:* (a = ANY
('{2,3,5,8,13,21,34,55}'::integer[]))
Index Searches: 8
Buffers: shared hit=980318 read=19721
Planning Time: 0.328 ms
Execution Time: 2127.811 ms
```
If we disable index_mergescan we naturally we fall into a sequential scan.
```sql
SET enable_indexmergescan = off;
EXPLAIN ANALYSE EXECUTE grid_query(1000000);
```
```text
Buffers: shared hit=15208 read=525410, temp read=20207 written=35384
-> Limit (cost=1942895.64..2059362.12 rows=1000000 width=16) (actual
time=3467.012..3712.044 rows=1000000.00 loops=1)
Buffers: shared hit=15208 read=525410, temp read=20207
written=35384
-> Gather Merge (cost=1942895.64..2853663.39 rows=7819999
width=16) (actual time=3467.010..3671.220 rows=1000000.00 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=15208 read=525410, temp read=20207
written=35384
-> Sort (cost=1941895.62..1950041.45 rows=3258333
width=16) (actual time=3455.852..3476.358 rows=334576.33 loops=3)
Sort Key: grid.b
Sort Method: *external merge Disk: 47016kB*
Buffers: shared hit=15208 read=525410, temp read=20207
written=35384
Worker 0: Sort Method: external merge Disk: 46976kB
Worker 1: Sort Method: external merge Disk: 47000kB
-> *Parallel Seq Scan* on grid
(cost=0.00..1478044.00 rows=3258333 width=16) (actual time=2.789..2779.483
rows=2666666.67 loops=3)
Filter: ((b >= 0) AND (a = ANY
('{2,3,5,8,13,21,34,55}'::integer[])))
Rows Removed by Filter: 30666667
Buffers: shared hit=15134 read=525410
Planning Time: 0.332 ms
Execution Time: 3761.866 ms
```
If we disable sequential scans, then we get a bitmap scan
```sql
SET enable_seqscan = off;
EXPLAIN ANALYSE EXECUTE grid_query(1000000);
```
```text
Buffers: shared hit=629 read=113759 written=2, temp read=20207
written=35385
-> Limit (cost=1998199.78..2114666.26 rows=1000000 width=16) (actual
time=5170.456..5453.433 rows=1000000.00 loops=1)
Buffers: shared hit=629 read=113759 written=2, temp read=20207
written=35385
-> Gather Merge (cost=1998199.78..2908967.53 rows=7819999
width=16) (actual time=5170.455..5413.254 rows=1000000.00 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=629 read=113759 written=2, temp
read=20207 written=35385
-> Sort (cost=1997199.75..2005345.59 rows=3258333
width=16) (actual time=5156.929..5177.507 rows=334500.67 loops=3)
Sort Key: grid.b
Sort Method: external merge Disk: 47032kB
Buffers: shared hit=629 read=113759 written=2, temp
read=20207 written=35385
Worker 0: Sort Method: external merge Disk: 47280kB
Worker 1: Sort Method: external merge Disk: 46680kB
-> Parallel Bitmap Heap Scan on grid
(cost=107691.54..1533348.13 rows=3258333 width=16) (actual
time=299.891..4489.787 rows=2666666.67 loops=3)
Recheck Cond: ((a = ANY
('{2,3,5,8,13,21,34,55}'::integer[])) AND (b >= 0))
Rows *Removed by Index Recheck*: 2410242
Heap Blocks: exact=13100 lossy=22639
Buffers: shared hit=615 read=113759 written=2
Worker 0: Heap Blocks: exact=13077 lossy=22755
Worker 1: Heap Blocks: exact=13036 lossy=22421
-> *Bitmap Index Scan* on grid_a_b_c_idx
(cost=0.00..105736.54 rows=7820000 width=0) (actual time=297.651..297.651
rows=8000000.00 loops=1)
Index Cond: ((a = ANY
('{2,3,5,8,13,21,34,55}'::integer[])) AND (b >= 0))
Index Searches: 7
Buffers: shared hit=13 read=7293 written=2
Planning Time: 0.165 ms
Execution Time: 5487.213 ms
```
If we disable bitmap scans we finally get an index scan
```sql
SET enable_bitmapscan = off;
EXPLAIN ANALYSE EXECUTE grid_query(1000000);
```
```
Buffers: shared hit=7883221 read=124111, temp read=20699 written=35385
-> Limit (cost=7201203.08..7317669.55 rows=1000000 width=16) (actual
time=4414.478..4674.400 rows=1000000.00 loops=1)
Buffers: shared hit=7883221 read=124111, temp read=20699
written=35385
-> Gather Merge (cost=7201203.08..8111970.83 rows=7819999
width=16) (actual time=4414.476..4633.982 rows=1000000.00 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=7883221 read=124111, temp read=20699
written=35385
-> Sort (cost=7200203.05..7208348.88 rows=3258333
width=16) (actual time=4390.625..4411.896 rows=334567.00 loops=3)
Sort Key: grid.b
Sort Method: *external merge Disk: 47304kB*
Buffers: shared hit=7883221 read=124111, temp
read=20699 written=35385
Worker 0: Sort Method: external merge Disk: 47304kB
Worker 1: Sort Method: external merge Disk: 46384kB
-> *Parallel Index Scan* using grid_a_b_c_idx on grid
(cost=0.57..6736351.43 rows=3258333 width=16) (actual
time=46.925..3796.915 rows=2666666.67 loops=3)
Index Cond: ((a = ANY
('{2,3,5,8,13,21,34,55}'::integer[])) AND (b >= 0))
Index Searches: 7
Buffers: shared hit=7883208 read=124110
Planning Time: 0.385 ms
Execution Time: 4713.325 ms
```
On Thu, Feb 5, 2026 at 6:59 AM Alexandre Felipe <
o(dot)alexandre(dot)felipe(at)gmail(dot)com> wrote:
> Thank you for looking into this.
>
> Now we can execute a, still narrow, family queries!
>
> Maybe it helps to see this as a *social network feeds*. Imagine a social
> network, you have a few friends, or follow a few people, and you want to
> see their updates ordered by date. For each user we have a different
> combination of users that we have to display. But maybe, even having
> hundreds of users we will only show the first 10.
>
> There is a low hanging fruit on the skip scan, if we need N rows, and one
> group already has M rows we could stop there.
> If Nx is the number of friends, and M is the number of posts to show.
> This runs with complexity (Nx * M) rows, followed by an (Nx * M) sort,
> instead of (Nx * N) followed by an (Nx * N) sort.
> Where M = 10 and N is 1000 this is a significant improvement.
> But if M ~ N, the merge scan that runs with M + Nx row accesses, (M + Nx)
> heap operations.
> If everything is on the same page the skip scan would win.
>
> The cost estimation is probably far off.
> I am also not considering the filters applied after this operator, and I
> don't know if the planner infrastructure is able to adjust it by itself.
> This is where I would like reviewer's feedback. I think that the planner
> costs are something to be determined experimentally.
>
> Next I will make it slightly more general handling
> * More index columns: Index (a, b, s...) could support WHERE a IN (...)
> ORDER BY b LIMIT N (ignoring s...)
> * Multi-column prefix: WHERE (a, b) IN (...) ORDER BY c
> * Non-leading prefix: WHERE b IN (...) AND a = const ORDER BY c on index
> (a, b, c)
>
> ---
> Kind Regards,
> Alexandre
>
> On Wed, Feb 4, 2026 at 7:13 AM Michał Kłeczek <michal(at)kleczek(dot)org> wrote:
>
>>
>>
>> On 3 Feb 2026, at 22:42, Ants Aasma <ants(dot)aasma(at)cybertec(dot)at> wrote:
>>
>> On Mon, 2 Feb 2026 at 01:54, Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>>
>> I'm also wondering how common is the targeted query pattern? How common
>> it is to have an IN condition on the leading column in an index, and
>> ORDER BY on the second one?
>>
>>
>> I have seen this pattern multiple times. My nickname for it is the
>> timeline view. Think of the social media timeline, showing posts from
>> all followed accounts in timestamp order, returned in reasonably sized
>> batches. The naive SQL query will have to scan all posts from all
>> followed accounts and pass them through a top-N sort. When the total
>> number of posts is much larger than the batch size this is much slower
>> than what is proposed here (assuming I understand it correctly) -
>> effectively equivalent to running N index scans through Merge Append.
>>
>>
>> My workarounds I have proposed users have been either to rewrite the
>> query as a UNION ALL of a set of single value prefix queries wrapped
>> in an order by limit. This gives the exact needed merge append plan
>> shape. But repeating the query N times can get unwieldy when the
>> number of values grows, so the fallback is:
>>
>> SELECT * FROM unnest(:friends) id, LATERAL (
>> SELECT * FROM posts
>> WHERE user_id = id
>> ORDER BY tstamp DESC LIMIT 100)
>> ORDER BY tstamp DESC LIMIT 100;
>>
>> The downside of this formulation is that we still have to fetch a
>> batch worth of items from scans where we otherwise would have only had
>> to look at one index tuple.
>>
>>
>> GIST can be used to handle this kind of queries as it supports multiple
>> sort orders.
>> The only problem is that GIST does not support ORDER BY column.
>> One possible workaround is [1] but as described there it does not play
>> well with partitioning.
>> I’ve started drafting support for ORDER BY column in GIST - see [2].
>> I think it would be easier to implement and maintain than a new IAM (but
>> I don’t have enough knowledge and experience to implement it myself)
>>
>> [1]
>> https://www.postgresql.org/message-id/3FA1E0A9-8393-41F6-88BD-62EEEA1EC21F%40kleczek.org
>> [2]
>> https://www.postgresql.org/message-id/B2AC13F9-6655-4E27-BFD3-068844E5DC91%40kleczek.org
>>
>> —
>> Kind regards,
>> Michal
>>
>
| Attachment | Content-Type | Size |
|---|---|---|
| 0003-MERGE-SCAN-Planner-integration.patch | application/octet-stream | 27.6 KB |
| 0004-MERGE-SCAN-Multi-column.patch | application/octet-stream | 61.3 KB |
| 0001-MERGE-SCAN-Test-the-baseline.patch | application/octet-stream | 7.5 KB |
| 0002-MERGE-SCAN-Access-method.patch | application/octet-stream | 49.1 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Christoph Berg | 2026-02-06 10:52:18 | Re: [PATCH] Add last_executed timestamp to pg_stat_statements |
| Previous Message | Boris Mironov | 2026-02-06 10:48:03 | Re: Idea to enhance pgbench by more modes to generate data (multi-TXNs, UNNEST, COPY BINARY) |