Introduce Index Aggregate - new GROUP BY strategy

From: Sergey Soloviev <sergey(dot)soloviev(at)tantorlabs(dot)ru>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Introduce Index Aggregate - new GROUP BY strategy
Date: 2025-12-08 15:36:58
Message-ID: 7d6d8cc4-d74a-4edc-bbeb-944916cf1093@tantorlabs.ru
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi, hackers!

I would like to introduce new GROUP BY strategy, called Index Aggregate.
In a nutshell, we build B+tree index where GROUP BY attributes are index
keys and if memory limit reached we will build index for each batch and
spill it to the disk as sorted run performing final external merge.

It works (and implemented) much like Hash Aggregate and most differences
in spill logic:

1. As tuples arrive build in-memory B+tree index
2. If memory limit reached then switch to the spill mode (almost like hashagg):
     - calculate hash for the tuple
     - decide in which batch it should be stored
     - spill tuples to the batch
3. When all tuples are processed and there is no disk spill, then return all tuples
    from in-memory index
4. Otherwise:
     1. Spill current index to disk creating initial sorted run
     2. Re-read each batch building in-memory index (may be spills again)
     3. At the end of batch spill it to the disk and create another sorted run
     4. Perform final external merge sort

The main benefit of this strategy is that we perform both grouping and sorting
at the same time with early aggregation. So, it's cost calculated for both group
and comparison, but we can win using early aggregation (which is not supported
by Sort + Group node).

When I was fixing tests, then most of changes occurred in partition_aggregate.out.
Their output changed in such way:

```
CREATE TABLE pagg_tab (a int, b int, c text, d int) PARTITION BY LIST(c);
CREATE TABLE pagg_tab_p1 PARTITION OF pagg_tab FOR VALUES IN ('0000', '0001', '0002', '0003', '0004');
CREATE TABLE pagg_tab_p2 PARTITION OF pagg_tab FOR VALUES IN ('0005', '0006', '0007', '0008');
CREATE TABLE pagg_tab_p3 PARTITION OF pagg_tab FOR VALUES IN ('0009', '0010', '0011');
INSERT INTO pagg_tab SELECT i % 20, i % 30, to_char(i % 12, 'FM0000'), i % 30 FROM generate_series(0, 2999) i;
ANALYZE pagg_tab;

EXPLAIN (COSTS OFF)
SELECT count(*) FROM pagg_tab GROUP BY c ORDER BY c LIMIT 1;

-- Old
                                            QUERY PLAN
--------------------------------------------------------------------------------------------------
 Limit  (cost=80.18..80.18 rows=1 width=13)
   ->  Sort  (cost=80.18..80.21 rows=12 width=13)
         Sort Key: pagg_tab.c
         ->  HashAggregate  (cost=80.00..80.12 rows=12 width=13)
               Group Key: pagg_tab.c
               ->  Append  (cost=0.00..65.00 rows=3000 width=5)
                     ->  Seq Scan on pagg_tab_p1 pagg_tab_1 (cost=0.00..20.50 rows=1250 width=5)
                     ->  Seq Scan on pagg_tab_p2 pagg_tab_2 (cost=0.00..17.00 rows=1000 width=5)
                     ->  Seq Scan on pagg_tab_p3 pagg_tab_3 (cost=0.00..12.50 rows=750 width=5)

-- New
SET enable_hashagg to off;
                                         QUERY PLAN
--------------------------------------------------------------------------------------------
 Limit  (cost=129.77..129.49 rows=1 width=13)
   ->  IndexAggregate  (cost=129.77..126.39 rows=12 width=13)
         Group Key: pagg_tab.c
         ->  Append  (cost=0.00..65.00 rows=3000 width=5)
               ->  Seq Scan on pagg_tab_p1 pagg_tab_1 (cost=0.00..20.50 rows=1250 width=5)
               ->  Seq Scan on pagg_tab_p2 pagg_tab_2 (cost=0.00..17.00 rows=1000 width=5)
               ->  Seq Scan on pagg_tab_p3 pagg_tab_3 (cost=0.00..12.50 rows=750 width=5)
(7 rows)

```

There is a cheat - disable hashagg, but if we will run this, then (on my PC) we will see
that index aggregate executes faster:

```
-- sort + hash
SET enable_hashagg TO on;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=80.18..80.18 rows=1 width=13) (actual time=2.040..2.041 rows=1.00 loops=1)
   Buffers: shared hit=20
   ->  Sort  (cost=80.18..80.21 rows=12 width=13) (actual time=2.039..2.040 rows=1.00 loops=1)
         Sort Key: pagg_tab.c
         Sort Method: top-N heapsort  Memory: 25kB
         Buffers: shared hit=20
         ->  HashAggregate  (cost=80.00..80.12 rows=12 width=13) (actual time=2.025..2.028 rows=12.00 loops=1)
               Group Key: pagg_tab.c
               Batches: 1  Memory Usage: 32kB
               Buffers: shared hit=20
               ->  Append  (cost=0.00..65.00 rows=3000 width=5) (actual time=0.017..0.888 rows=3000.00 loops=1)
                     Buffers: shared hit=20
                     ->  Seq Scan on pagg_tab_p1 pagg_tab_1 (cost=0.00..20.50 rows=1250 width=5) (actual time=0.016..0.301 rows=1250.00 loops=1)
                           Buffers: shared hit=8
                     ->  Seq Scan on pagg_tab_p2 pagg_tab_2 (cost=0.00..17.00 rows=1000 width=5) (actual time=0.007..0.225 rows=1000.00 loops=1)
                           Buffers: shared hit=7
                     ->  Seq Scan on pagg_tab_p3 pagg_tab_3 (cost=0.00..12.50 rows=750 width=5) (actual time=0.006..0.171 rows=750.00 loops=1)
                           Buffers: shared hit=5
 Planning Time: 0.119 ms
 Execution Time: 2.076 ms
(20 rows)

-- index agg
SET enable_hashagg TO off;
 QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=129.77..129.49 rows=1 width=13) (actual time=1.789..1.790 rows=1.00 loops=1)
   Buffers: shared hit=20
   ->  IndexAggregate  (cost=129.77..126.39 rows=12 width=13) (actual time=1.788..1.789 rows=1.00 loops=1)
         Group Key: pagg_tab.c
         Buffers: shared hit=20
         ->  Append  (cost=0.00..65.00 rows=3000 width=5) (actual time=0.020..0.865 rows=3000.00 loops=1)
               Buffers: shared hit=20
               ->  Seq Scan on pagg_tab_p1 pagg_tab_1 (cost=0.00..20.50 rows=1250 width=5) (actual time=0.020..0.290 rows=1250.00 loops=1)
                     Buffers: shared hit=8
               ->  Seq Scan on pagg_tab_p2 pagg_tab_2 (cost=0.00..17.00 rows=1000 width=5) (actual time=0.007..0.229 rows=1000.00 loops=1)
                     Buffers: shared hit=7
               ->  Seq Scan on pagg_tab_p3 pagg_tab_3 (cost=0.00..12.50 rows=750 width=5) (actual time=0.007..0.165 rows=750.00 loops=1)
                     Buffers: shared hit=5
 Planning Time: 0.105 ms
 Execution Time: 1.825 ms
(15 rows)
```

Mean IndexAgg time is about 1.8 ms and 2 ms for hash + sort, so win is about 10%.

Also, I have run TPC-H tests and 2 tests used Index Agg node (4 and 5) and this gave
near 5% gain in time.

This research was inspired by Graefe Goetz's paper "Efficient sorting, duplicate
removal, grouping, and aggregation". But some of proposed ideas are hard
to implement in PostgreSQL, i.e. using partitioned btrees  store their page in
shared buffers or to make use of offset-value encoding.

More about details of implementation:

1. In-memory index implemented as B+tree and it stores pointers to tuples
2. Size of each B+tree node is set using macro. Now it is 63, which allows us
    to use some optimizations, i.e. distribute tuples uniformly during page split
3. In-memory index has key abbreviation optimization
3. tuplesort.c is used to implement external merge sort. This is done by just
    setting up state in such way that we can just call 'mergeruns'
4. When we store tuples on disk during sorted run spill we perform projection
    and stored tuples are ready to be returned after merge. This is done most
    because we already have returninig TupleDesc and do not have to deal with
    AggStatePerGroup state (it has complex logic with 2 boolean flags).

For now there is a bare minimum implemented: in-memory index, disk spill logic
and support by explain analyze.

There are 4 patches attached:

1. 0001-add-in-memory-btree-tuple-index.patch - adds in-memory index - TupleIndex
2. 0002-introduce-AGG_INDEX-grouping-strategy-node.patch - implementation of
                                                     Index Aggregate group strategy
3. 0003-make-use-of-IndexAggregate-in-planner-and-explain.patch - planner adds
                                                    Index Aggregate nodes to the pathlist and explain analyze
                                                    shows statistics for this node
4. 0004-fix-tests-for-IndexAggregate.patch - fix tests output and adds some extra tests
                                                    for the new node

There are open questions and todos:

- No support for parallel execution. The main challenge here is to save sort invariant
  and support partial aggregates.
- Use more suitable in-memory index. For example, T-Tree is the first candidate for this.
- No sgml documentation yet
- Fix and adapt tests. Not all tests are fixed by 4 patch
- Tune planner estimate. In the example, cost of index agg was higher, but actually it was
  faster.

---

Sergey Soloviev

TantorLabs: https://tantorlabs.com

Attachment Content-Type Size
0001-add-in-memory-btree-tuple-index.patch text/x-patch 23.6 KB
0002-introduce-AGG_INDEX-grouping-strategy-node.patch text/x-patch 86.4 KB
0003-make-use-of-IndexAggregate-in-planner-and-explain.patch text/x-patch 18.9 KB
0004-fix-tests-for-IndexAggregate.patch text/x-patch 31.3 KB

Browse pgsql-hackers by date

  From Date Subject
Next Message Shruthi Gowda 2025-12-08 15:38:56 [BUG] CRASH: ECPGprepared_statement() and ECPGdeallocate_all() when connection is NULL
Previous Message Jelte Fennema-Nio 2025-12-08 15:31:30 Re: Make copyObject work in C++