Benchmark: Width-Adjusted parallel_tuple_cost ============================================== Timestamp: 2026-03-30T12:19:00Z Patch: 0001-Scale-parallel_tuple_cost-by-tuple-width-at-Gather-n.patch Base commit: 01d58d7e3ff (PostgreSQL 19devel) Hardware -------- Architecture: aarch64 (ARM64) CPUs: 6 RAM: 11 GB Disk: SSD PostgreSQL Configuration ------------------------ shared_buffers: 2GB work_mem: 32MB max_parallel_workers_per_gather: 4 max_parallel_workers: 8 parallel_tuple_cost: 0.1 (default) io_method: sync Build flags: -Dcassert=false -Db_ndebug=true -Dbuildtype=debugoptimized Overview -------- The parallel_tuple_cost GUC applies a flat per-tuple penalty to Gather and Gather Merge nodes regardless of how wide the tuples are. For queries where partial aggregate results pass through the tuple queue, these tuples are typically very narrow (8-52 bytes), but are charged the same 0.1/tuple as wide rows. This overcharges narrow-tuple Gathers and can cause the planner to reject parallel plans that are 2-3x faster. The patch scales parallel_tuple_cost by tuple width relative to a 100-byte reference, with a 10% fixed floor for irreducible queue synchronization overhead: effective_cost = parallel_tuple_cost * (0.10 + 0.90 * max(width, 1) / 100) Width 12 bytes -> factor 0.208 -> effective cost 0.021/tuple Width 52 bytes -> factor 0.568 -> effective cost 0.057/tuple Width 100 bytes -> factor 1.000 -> effective cost 0.100/tuple (unchanged) Width 148 bytes -> factor 1.432 -> effective cost 0.143/tuple Benchmark 1: Narrow-Output Aggregate (Plan Flip: Serial -> Parallel) -------------------------------------------------------------------- Table setup: CREATE TABLE bench_wide AS SELECT i AS id, (i % 5000000) AS group_id, random() * 1000 AS val1, random() * 1000 AS val2, repeat('x', 200) AS padding FROM generate_series(1, 50000000) i; VACUUM ANALYZE bench_wide; 50M rows, 12 GB on disk. 5 columns: id int4 (4 bytes), group_id int4 (4 bytes), val1 float8 (8 bytes), val2 float8 (8 bytes), padding text (avg 204 bytes). 5M distinct group_id values (10 rows per group). Source rows are wide (avg ~228 bytes) but the aggregate output is narrow: group_id + 3 aggregate accumulators = width 52 at Gather. Query: SELECT group_id, count(*), sum(val1), avg(val2) FROM bench_wide GROUP BY group_id ORDER BY count(*) DESC LIMIT 10; With 4 workers and 5M groups, this produces ~22.5M partial aggregate rows (width 52) through Gather Merge. The Gather Merge cost contribution is the decisive factor: Unpatched: 0.1 * 22.5M = 2,250,000 Patched: 0.1 * 0.568 * 22.5M = 1,278,000 (43% less) Results: UNPATCHED — planner chooses serial despite parallel being available: Limit (cost=6734041..6734041 rows=10 width=28) -> Sort (cost=6734041..6748078 rows=5614666 width=28) -> HashAggregate (cost=5956600..6612710 rows=5614666 width=28) Group Key: group_id Planned Partitions: 32 -> Seq Scan on bench_wide (cost=0..2112919 rows=49999100 width=20) Execution times: 30783ms, 27386ms, 24555ms, 24826ms Median: ~26s PATCHED — planner now correctly chooses parallel: Limit (cost=5753518..5753518 rows=10 width=28) -> Sort (cost=5753518..5767555 rows=5614666 width=28) -> Finalize GroupAggregate (cost=3468705..5632187 rows=5614666 width=28) -> Gather Merge (cost=3468705..5337417 rows=22458664 width=52) Workers Planned: 4 -> Partial GroupAggregate (cost=3467705..3680099 rows=5614666 width=52) -> Sort (cost=3467705..3498955 rows=12499775 width=20) -> Parallel Seq Scan on bench_wide (cost=0..1737926 rows=12499775 width=20) Execution times: 9197ms, 9675ms, 9056ms, 11078ms Median: ~9.4s Verification — unpatched binary, parallel forced with parallel_tuple_cost=0.001: Same parallel plan structure as patched. Execution times: 9023ms, 9646ms, 9548ms, 11196ms Median: ~9.6s Summary: Unpatched (serial, planner's choice): ~26s Patched (parallel, planner's choice): ~9.4s Speedup: 2.7x The parallel plan is genuinely faster. The unpatched planner refused to pick it because the flat 0.1/tuple * 22.5M rows = 2.25M Gather cost made the parallel total (5.75M) appear close to the serial total (6.73M), and the serial plan avoided the Finalize GroupAggregate overhead. With width adjustment, the Gather cost drops to 1.28M, making the parallel plan clearly cheaper. Benchmark 2: Wide-Output Aggregate (No Regression) --------------------------------------------------- Table setup: CREATE TABLE bench_narrow AS SELECT i AS id, (i % 500000) AS group_id, (random() * 1000)::numeric(10,2) AS val1, (random() * 1000)::numeric(10,2) AS val2, (random() * 1000)::numeric(10,2) AS val3, (random() * 1000)::numeric(10,2) AS val4, (random() * 1000)::numeric(10,2) AS val5, (random() * 1000)::numeric(10,2) AS val6, (random() * 1000)::numeric(10,2) AS val7, (random() * 1000)::numeric(10,2) AS val8 FROM generate_series(1, 20000000) i; VACUUM ANALYZE bench_narrow; 20M rows, 1776 MB on disk. 10 columns: id int4 (4 bytes), group_id int4 (4 bytes), val1..val8 numeric(10,2) (avg 6 bytes each). 500K distinct group_id values (40 rows per group). Source rows are narrow (avg ~52 bytes) but the aggregate output is wide: group_id + count + 4 sums + 4 avgs (each avg expands to sum + count internally) = width 268 at Gather Merge. Query: SELECT group_id, count(*), sum(val1), sum(val2), sum(val3), sum(val4), avg(val5), avg(val6), avg(val7), avg(val8) FROM bench_narrow GROUP BY group_id ORDER BY count(*) DESC LIMIT 10; With 4 workers and 500K groups, this produces ~2M partial aggregate rows (width 268) through Gather Merge. The width-adjusted cost correctly charges MORE for these wide tuples: Unpatched Gather Merge: cost 1,372,038 (0.1 * 2.08M = 208K contribution) Patched Gather Merge: cost 1,702,787 (0.1 * 2.412 * 2.08M = 502K contribution) Both patched and unpatched choose the same parallel plan. Results: UNPATCHED: Limit (cost=1492668..1492668 rows=10 width=268) -> Sort (cost=1492668..1493970 rows=520832 width=268) -> Finalize GroupAggregate (cost=1122591..1481413 rows=520832 width=268) -> Gather Merge (cost=1122591..1372038 rows=2083328 width=268) Workers Planned: 4 -> Sort (cost=1121591..1122893 rows=520832 width=268) -> Partial HashAggregate (cost=892982..1006267 rows=520832 width=268) -> Parallel Seq Scan on bench_narrow (cost=0..277330 rows=5000216 width=52) Execution times: 6525ms, 6575ms, 6447ms Median: ~6.5s PATCHED: Limit (cost=1823417..1823417 rows=10 width=268) -> Sort (cost=1823417..1824719 rows=520832 width=268) -> Finalize GroupAggregate (cost=1122591..1812162 rows=520832 width=268) -> Gather Merge (cost=1122591..1702787 rows=2083328 width=268) Workers Planned: 4 -> Sort (cost=1121591..1122893 rows=520832 width=268) -> Partial HashAggregate (cost=892982..1006267 rows=520832 width=268) -> Parallel Seq Scan on bench_narrow (cost=0..277330 rows=5000216 width=52) Execution times: 6784ms, 6869ms, 7047ms Median: ~6.9s Summary: Same plan on both. Patched estimated cost is 22% higher (1.82M vs 1.49M) because it correctly charges 2.41x the base rate for width-268 tuples. Execution times are within noise — the higher cost estimate does not cause a regression to serial.