Re: MergeAppend could consider sorting cheapest child path

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Andrei Lepikhov <lepihov(at)gmail(dot)com>
Cc: Alexander Pyhalov <a(dot)pyhalov(at)postgrespro(dot)ru>, Andy Fan <zhihuifan1213(at)163(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Nikita Malakhov <HukuToc(at)gmail(dot)com>
Subject: Re: MergeAppend could consider sorting cheapest child path
Date: 2025-07-26 22:51:13
Message-ID: CAPpHfdvgAjOpLmZntLUSR_8GD=S0nCkFxC6UxQrJVOzLc7yi_Q@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jul 22, 2025 at 2:13 PM Andrei Lepikhov <lepihov(at)gmail(dot)com> wrote:
>
> On 4/6/2025 00:41, Alexander Korotkov wrote:
> > On Tue, Jun 3, 2025 at 5:35 PM Andrei Lepikhov <lepihov(at)gmail(dot)com>
wrote:
> >> For me, it seems like a continuation of the 7d8ac98 discussion. We may
> >> charge a small fee for MergeAppend to adjust the balance, of course.
> >> However, I think this small change requires a series of benchmarks to
> >> determine how it affects the overall cost balance. Without examples it
> >> is hard to say how important this issue is and its worthiness to
> >> commence such work.
> >
> > Yes, I think it's fair to charge the MergeAppend node. We currently
> > cost it similarly to Sort merge stage, but it's clearly more
> > expensive. It dealing on the executor level dealing with Slot's etc,
> > while Sort node have a set of lower level optimizations.
> After conducting additional research, I concluded that you are correct,
> and the current cost model doesn't allow the optimiser to detect the
> best option. A simple test with a full scan and sort of a partitioned
> table (see attachment) shows that the query plan prefers small sortings
> merged by the MergeAppend node. I have got the following results for
> different numbers of tuples to be sorted (in the range from 10 tuples to
> 1E8):
>
> EXPLAIN SELECT * FROM test ORDER BY y;
>
> 1E1: Sort (cost=9.53..9.57 rows=17 width=8)
> 1E2: Sort (cost=20.82..21.07 rows=100)
> 1E3: Merge Append (cost=56.19..83.69 rows=1000)
> 1E4: Merge Append (cost=612.74..887.74 rows=10000)
> 1E5: Merge Append (cost=7754.19..10504.19 rows=100000)
> 1E6: Merge Append (cost=94092.25..121592.25 rows=1000000)
> 1E7: Merge Append (cost=1106931.22..1381931.22 rows=10000000)
> 1E8: Merge Append (cost=14097413.40..16847413.40 rows=100000000)
>
> That happens because both total costs lie within the fuzzy factor gap,
> and the optimiser chooses the path based on the startup cost, which is
> obviously better for the MergeAppend case.
>
> At the same time, execution, involving a single Sort node, dominates the
> MergeAppend case:
>
> 1E3: MergeAppend: 1.927 ms, Sort: 0.720 ms
> 1E4: MergeAppend: 10.090 ms, Sort: 7.583 ms
> 1E5: MergeAppend: 118.885 ms, Sort: 88.492 ms
> 1E6: MergeAppend: 1372.717 ms, Sort: 1106.184 ms
> 1E7: MergeAppend: 15103.893 ms, Sort: 13415.806 ms
> 1E8: MergeAppend: 176553.133 ms, Sort: 149458.787 ms
>
> Looking inside the code, I found the only difference we can employ to
> rationalise the cost model change: re-structuring of a heap. The
> siftDown routine employs two tuple comparisons to find the proper
> position for a tuple. So, we have objections to changing the constant in
> the cost model of the merge operation:
>
> @@ -2448,7 +2448,7 @@ cost_merge_append(Path *path, PlannerInfo *root,
> logN = LOG2(N);
>
> /* Assumed cost per tuple comparison */
> - comparison_cost = 2.0 * cpu_operator_cost;
> + comparison_cost = 4.0 * cpu_operator_cost;
>
> /* Heap creation cost */
> startup_cost += comparison_cost * N * logN;
>
> The exact change also needs to be made in the cost_gather_merge
> function, of course.
> At this moment, I'm not sure that it should be multiplied by 2 - it is a
> subject for further discussion. However, it fixes the current issue and
> adds a little additional cost to the merge operation, which, as I see
> it, is quite limited.
> Please see the new version of the patch attached.

I've checked the cost adjustment you've made. If you change the cost of
top-N sort, you must also change the prior if condition on when it gets
selected. We currently do the switch on tuples = 2 * limit_tuples. If we
apply the proposed change, we should switch on tuples = limit_tuples^2 ^
2. But also, in order for this cost to reflect reality, we must change
tuplesort_puttuple_common() in the same way. These inconsistencies lead to
failures on contrib/postgres_fdw checks. But these changes don't appear to
be a win. See the example below. Top-N sort appears to be a win for
LIMITS up to 500000.

# EXPLAIN ANALYZE SELECT * FROM (SELECT random() r FROM generate_series(1,
1000000) i) x ORDER BY x.r LIMIT 10000;
QUERY
PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=112157.85..112182.85 rows=10000 width=8) (actual
time=638.998..639.841 rows=10000.00 loops=1)
-> Sort (cost=112157.85..114657.85 rows=1000000 width=8) (actual
time=638.996..639.340 rows=10000.00 loops=1)
Sort Key: (random())
Sort Method: quicksort Memory: 24577kB
-> Function Scan on generate_series i (cost=0.00..12500.00
rows=1000000 width=8) (actual time=126.582..205.610 rows=1000000.00 loops=1)
Planning Time: 0.118 ms
Execution Time: 653.283 ms
(7 rows)

# EXPLAIN ANALYZE SELECT * FROM (SELECT random() r FROM generate_series(1,
1000000) i) x ORDER BY x.r LIMIT 500000;
QUERY
PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=112157.85..113407.85 rows=500000 width=8) (actual
time=646.522..688.573 rows=500000.00 loops=1)
-> Sort (cost=112157.85..114657.85 rows=1000000 width=8) (actual
time=646.520..663.562 rows=500000.00 loops=1)
Sort Key: (random())
Sort Method: quicksort Memory: 24577kB
-> Function Scan on generate_series i (cost=0.00..12500.00
rows=1000000 width=8) (actual time=129.028..208.936 rows=1000000.00 loops=1)
Planning Time: 0.188 ms
Execution Time: 713.738 ms
(7 rows)

# EXPLAIN ANALYZE SELECT * FROM (SELECT random() r FROM generate_series(1,
1000000) i) x ORDER BY x.r LIMIT 10000;
QUERY
PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=78938.56..78963.56 rows=10000 width=8) (actual
time=412.633..413.459 rows=10000.00 loops=1)
Buffers: shared hit=3, temp read=1709 written=1709
-> Sort (cost=78938.56..81438.56 rows=1000000 width=8) (actual
time=412.631..412.969 rows=10000.00 loops=1)
Sort Key: (random())
Sort Method: top-N heapsort Memory: 769kB
Buffers: shared hit=3, temp read=1709 written=1709
-> Function Scan on generate_series i (cost=0.00..12500.00
rows=1000000 width=8) (actual time=185.892..333.233 rows=1000000.00 loops=1)
Buffers: temp read=1709 written=1709
Planning:
Buffers: shared hit=7 read=8
Planning Time: 2.058 ms
Execution Time: 416.040 ms
(12 rows)

# EXPLAIN ANALYZE SELECT * FROM (SELECT random() r FROM generate_series(1,
1000000) i) x ORDER BY x.r LIMIT 500000;
QUERY
PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=112157.85..113407.85 rows=500000 width=8) (actual
time=631.185..673.768 rows=500000.00 loops=1)
-> Sort (cost=112157.85..114657.85 rows=1000000 width=8) (actual
time=631.183..649.072 rows=500000.00 loops=1)
Sort Key: (random())
Sort Method: quicksort Memory: 24577kB
-> Function Scan on generate_series i (cost=0.00..12500.00
rows=1000000 width=8) (actual time=121.274..200.453 rows=1000000.00 loops=1)
Planning Time: 0.243 ms
Execution Time: 698.841 ms
(7 rows)

I've another idea. cost_tuplesort() puts 2.0 under logarithm to prefer
tuplesort over heapsort. I think we can adjust cost_gather_merge() and
cost_merge_append() to do the same. 0001 patch implements that. I think
the plan changes of 0001 might be reasonable since most cases deal with
small rowsets. One thing concerns me: 0002 still affects one of the
postgres_fdw checks. Could you, please, take a look?

------
Regards,
Alexander Korotkov
Supabase

Attachment Content-Type Size
v8-0001-Add-some-penatly-to-gather-merge-and-merge-append.patch application/octet-stream 13.4 KB
v8-0002-Consider-an-explicit-sort-of-the-MergeAppend-subp.patch application/octet-stream 26.3 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Treat 2025-07-27 01:59:10 Re: Adding REPACK [concurrently]
Previous Message Alvaro Herrera 2025-07-26 21:56:04 Adding REPACK [concurrently]