| From: | Sergey Soloviev <sergey(dot)soloviev(at)tantorlabs(dot)ru> |
|---|---|
| To: | pgsql-hackers(at)lists(dot)postgresql(dot)org |
| Subject: | Introduce Index Aggregate - new GROUP BY strategy |
| Date: | 2025-12-08 15:36:58 |
| Message-ID: | 7d6d8cc4-d74a-4edc-bbeb-944916cf1093@tantorlabs.ru |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi, hackers!
I would like to introduce new GROUP BY strategy, called Index Aggregate.
In a nutshell, we build B+tree index where GROUP BY attributes are index
keys and if memory limit reached we will build index for each batch and
spill it to the disk as sorted run performing final external merge.
It works (and implemented) much like Hash Aggregate and most differences
in spill logic:
1. As tuples arrive build in-memory B+tree index
2. If memory limit reached then switch to the spill mode (almost like hashagg):
- calculate hash for the tuple
- decide in which batch it should be stored
- spill tuples to the batch
3. When all tuples are processed and there is no disk spill, then return all tuples
from in-memory index
4. Otherwise:
1. Spill current index to disk creating initial sorted run
2. Re-read each batch building in-memory index (may be spills again)
3. At the end of batch spill it to the disk and create another sorted run
4. Perform final external merge sort
The main benefit of this strategy is that we perform both grouping and sorting
at the same time with early aggregation. So, it's cost calculated for both group
and comparison, but we can win using early aggregation (which is not supported
by Sort + Group node).
When I was fixing tests, then most of changes occurred in partition_aggregate.out.
Their output changed in such way:
```
CREATE TABLE pagg_tab (a int, b int, c text, d int) PARTITION BY LIST(c);
CREATE TABLE pagg_tab_p1 PARTITION OF pagg_tab FOR VALUES IN ('0000', '0001', '0002', '0003', '0004');
CREATE TABLE pagg_tab_p2 PARTITION OF pagg_tab FOR VALUES IN ('0005', '0006', '0007', '0008');
CREATE TABLE pagg_tab_p3 PARTITION OF pagg_tab FOR VALUES IN ('0009', '0010', '0011');
INSERT INTO pagg_tab SELECT i % 20, i % 30, to_char(i % 12, 'FM0000'), i % 30 FROM generate_series(0, 2999) i;
ANALYZE pagg_tab;
EXPLAIN (COSTS OFF)
SELECT count(*) FROM pagg_tab GROUP BY c ORDER BY c LIMIT 1;
-- Old
QUERY PLAN
--------------------------------------------------------------------------------------------------
Limit (cost=80.18..80.18 rows=1 width=13)
-> Sort (cost=80.18..80.21 rows=12 width=13)
Sort Key: pagg_tab.c
-> HashAggregate (cost=80.00..80.12 rows=12 width=13)
Group Key: pagg_tab.c
-> Append (cost=0.00..65.00 rows=3000 width=5)
-> Seq Scan on pagg_tab_p1 pagg_tab_1 (cost=0.00..20.50 rows=1250 width=5)
-> Seq Scan on pagg_tab_p2 pagg_tab_2 (cost=0.00..17.00 rows=1000 width=5)
-> Seq Scan on pagg_tab_p3 pagg_tab_3 (cost=0.00..12.50 rows=750 width=5)
-- New
SET enable_hashagg to off;
QUERY PLAN
--------------------------------------------------------------------------------------------
Limit (cost=129.77..129.49 rows=1 width=13)
-> IndexAggregate (cost=129.77..126.39 rows=12 width=13)
Group Key: pagg_tab.c
-> Append (cost=0.00..65.00 rows=3000 width=5)
-> Seq Scan on pagg_tab_p1 pagg_tab_1 (cost=0.00..20.50 rows=1250 width=5)
-> Seq Scan on pagg_tab_p2 pagg_tab_2 (cost=0.00..17.00 rows=1000 width=5)
-> Seq Scan on pagg_tab_p3 pagg_tab_3 (cost=0.00..12.50 rows=750 width=5)
(7 rows)
```
There is a cheat - disable hashagg, but if we will run this, then (on my PC) we will see
that index aggregate executes faster:
```
-- sort + hash
SET enable_hashagg TO on;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=80.18..80.18 rows=1 width=13) (actual time=2.040..2.041 rows=1.00 loops=1)
Buffers: shared hit=20
-> Sort (cost=80.18..80.21 rows=12 width=13) (actual time=2.039..2.040 rows=1.00 loops=1)
Sort Key: pagg_tab.c
Sort Method: top-N heapsort Memory: 25kB
Buffers: shared hit=20
-> HashAggregate (cost=80.00..80.12 rows=12 width=13) (actual time=2.025..2.028 rows=12.00 loops=1)
Group Key: pagg_tab.c
Batches: 1 Memory Usage: 32kB
Buffers: shared hit=20
-> Append (cost=0.00..65.00 rows=3000 width=5) (actual time=0.017..0.888 rows=3000.00 loops=1)
Buffers: shared hit=20
-> Seq Scan on pagg_tab_p1 pagg_tab_1 (cost=0.00..20.50 rows=1250 width=5) (actual time=0.016..0.301 rows=1250.00 loops=1)
Buffers: shared hit=8
-> Seq Scan on pagg_tab_p2 pagg_tab_2 (cost=0.00..17.00 rows=1000 width=5) (actual time=0.007..0.225 rows=1000.00 loops=1)
Buffers: shared hit=7
-> Seq Scan on pagg_tab_p3 pagg_tab_3 (cost=0.00..12.50 rows=750 width=5) (actual time=0.006..0.171 rows=750.00 loops=1)
Buffers: shared hit=5
Planning Time: 0.119 ms
Execution Time: 2.076 ms
(20 rows)
-- index agg
SET enable_hashagg TO off;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=129.77..129.49 rows=1 width=13) (actual time=1.789..1.790 rows=1.00 loops=1)
Buffers: shared hit=20
-> IndexAggregate (cost=129.77..126.39 rows=12 width=13) (actual time=1.788..1.789 rows=1.00 loops=1)
Group Key: pagg_tab.c
Buffers: shared hit=20
-> Append (cost=0.00..65.00 rows=3000 width=5) (actual time=0.020..0.865 rows=3000.00 loops=1)
Buffers: shared hit=20
-> Seq Scan on pagg_tab_p1 pagg_tab_1 (cost=0.00..20.50 rows=1250 width=5) (actual time=0.020..0.290 rows=1250.00 loops=1)
Buffers: shared hit=8
-> Seq Scan on pagg_tab_p2 pagg_tab_2 (cost=0.00..17.00 rows=1000 width=5) (actual time=0.007..0.229 rows=1000.00 loops=1)
Buffers: shared hit=7
-> Seq Scan on pagg_tab_p3 pagg_tab_3 (cost=0.00..12.50 rows=750 width=5) (actual time=0.007..0.165 rows=750.00 loops=1)
Buffers: shared hit=5
Planning Time: 0.105 ms
Execution Time: 1.825 ms
(15 rows)
```
Mean IndexAgg time is about 1.8 ms and 2 ms for hash + sort, so win is about 10%.
Also, I have run TPC-H tests and 2 tests used Index Agg node (4 and 5) and this gave
near 5% gain in time.
This research was inspired by Graefe Goetz's paper "Efficient sorting, duplicate
removal, grouping, and aggregation". But some of proposed ideas are hard
to implement in PostgreSQL, i.e. using partitioned btrees store their page in
shared buffers or to make use of offset-value encoding.
More about details of implementation:
1. In-memory index implemented as B+tree and it stores pointers to tuples
2. Size of each B+tree node is set using macro. Now it is 63, which allows us
to use some optimizations, i.e. distribute tuples uniformly during page split
3. In-memory index has key abbreviation optimization
3. tuplesort.c is used to implement external merge sort. This is done by just
setting up state in such way that we can just call 'mergeruns'
4. When we store tuples on disk during sorted run spill we perform projection
and stored tuples are ready to be returned after merge. This is done most
because we already have returninig TupleDesc and do not have to deal with
AggStatePerGroup state (it has complex logic with 2 boolean flags).
For now there is a bare minimum implemented: in-memory index, disk spill logic
and support by explain analyze.
There are 4 patches attached:
1. 0001-add-in-memory-btree-tuple-index.patch - adds in-memory index - TupleIndex
2. 0002-introduce-AGG_INDEX-grouping-strategy-node.patch - implementation of
Index Aggregate group strategy
3. 0003-make-use-of-IndexAggregate-in-planner-and-explain.patch - planner adds
Index Aggregate nodes to the pathlist and explain analyze
shows statistics for this node
4. 0004-fix-tests-for-IndexAggregate.patch - fix tests output and adds some extra tests
for the new node
There are open questions and todos:
- No support for parallel execution. The main challenge here is to save sort invariant
and support partial aggregates.
- Use more suitable in-memory index. For example, T-Tree is the first candidate for this.
- No sgml documentation yet
- Fix and adapt tests. Not all tests are fixed by 4 patch
- Tune planner estimate. In the example, cost of index agg was higher, but actually it was
faster.
---
Sergey Soloviev
TantorLabs: https://tantorlabs.com
| Attachment | Content-Type | Size |
|---|---|---|
| 0001-add-in-memory-btree-tuple-index.patch | text/x-patch | 23.6 KB |
| 0002-introduce-AGG_INDEX-grouping-strategy-node.patch | text/x-patch | 86.4 KB |
| 0003-make-use-of-IndexAggregate-in-planner-and-explain.patch | text/x-patch | 18.9 KB |
| 0004-fix-tests-for-IndexAggregate.patch | text/x-patch | 31.3 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Shruthi Gowda | 2025-12-08 15:38:56 | [BUG] CRASH: ECPGprepared_statement() and ECPGdeallocate_all() when connection is NULL |
| Previous Message | Jelte Fennema-Nio | 2025-12-08 15:31:30 | Re: Make copyObject work in C++ |