Benchmark: Width-Adjusted parallel_tuple_cost
==============================================
Timestamp: 2026-03-30T12:19:00Z
Patch: 0001-Scale-parallel_tuple_cost-by-tuple-width-at-Gather-n.patch
Base commit: 01d58d7e3ff (PostgreSQL 19devel)

Hardware
--------
Architecture: aarch64 (ARM64)
CPUs:         6
RAM:          11 GB
Disk:         SSD

PostgreSQL Configuration
------------------------
shared_buffers:                  2GB
work_mem:                        32MB
max_parallel_workers_per_gather: 4
max_parallel_workers:            8
parallel_tuple_cost:             0.1 (default)
io_method:                       sync
Build flags:                     -Dcassert=false -Db_ndebug=true -Dbuildtype=debugoptimized


Overview
--------
The parallel_tuple_cost GUC applies a flat per-tuple penalty to Gather
and Gather Merge nodes regardless of how wide the tuples are.  For
queries where partial aggregate results pass through the tuple queue,
these tuples are typically very narrow (8-52 bytes), but are charged the
same 0.1/tuple as wide rows.  This overcharges narrow-tuple Gathers and
can cause the planner to reject parallel plans that are 2-3x faster.

The patch scales parallel_tuple_cost by tuple width relative to a
100-byte reference, with a 10% fixed floor for irreducible queue
synchronization overhead:

    effective_cost = parallel_tuple_cost *
        (0.10 + 0.90 * max(width, 1) / 100)

Width 12 bytes  -> factor 0.208  -> effective cost 0.021/tuple
Width 52 bytes  -> factor 0.568  -> effective cost 0.057/tuple
Width 100 bytes -> factor 1.000  -> effective cost 0.100/tuple (unchanged)
Width 148 bytes -> factor 1.432  -> effective cost 0.143/tuple


Benchmark 1: Narrow-Output Aggregate (Plan Flip: Serial -> Parallel)
--------------------------------------------------------------------

Table setup:

    CREATE TABLE bench_wide AS
    SELECT
      i AS id,
      (i % 5000000) AS group_id,
      random() * 1000 AS val1,
      random() * 1000 AS val2,
      repeat('x', 200) AS padding
    FROM generate_series(1, 50000000) i;
    VACUUM ANALYZE bench_wide;

  50M rows, 12 GB on disk.
  5 columns: id int4 (4 bytes), group_id int4 (4 bytes),
             val1 float8 (8 bytes), val2 float8 (8 bytes),
             padding text (avg 204 bytes).
  5M distinct group_id values (10 rows per group).
  Source rows are wide (avg ~228 bytes) but the aggregate output is
  narrow: group_id + 3 aggregate accumulators = width 52 at Gather.

Query:

    SELECT group_id, count(*), sum(val1), avg(val2)
    FROM bench_wide
    GROUP BY group_id
    ORDER BY count(*) DESC
    LIMIT 10;

With 4 workers and 5M groups, this produces ~22.5M partial aggregate
rows (width 52) through Gather Merge.  The Gather Merge cost
contribution is the decisive factor:

  Unpatched: 0.1 * 22.5M         = 2,250,000
  Patched:   0.1 * 0.568 * 22.5M = 1,278,000  (43% less)

Results:

  UNPATCHED — planner chooses serial despite parallel being available:

    Limit  (cost=6734041..6734041 rows=10 width=28)
      ->  Sort  (cost=6734041..6748078 rows=5614666 width=28)
            ->  HashAggregate  (cost=5956600..6612710 rows=5614666 width=28)
                  Group Key: group_id
                  Planned Partitions: 32
                  ->  Seq Scan on bench_wide  (cost=0..2112919 rows=49999100 width=20)

    Execution times: 30783ms, 27386ms, 24555ms, 24826ms
    Median: ~26s

  PATCHED — planner now correctly chooses parallel:

    Limit  (cost=5753518..5753518 rows=10 width=28)
      ->  Sort  (cost=5753518..5767555 rows=5614666 width=28)
            ->  Finalize GroupAggregate  (cost=3468705..5632187 rows=5614666 width=28)
                  ->  Gather Merge  (cost=3468705..5337417 rows=22458664 width=52)
                        Workers Planned: 4
                        ->  Partial GroupAggregate  (cost=3467705..3680099 rows=5614666 width=52)
                              ->  Sort  (cost=3467705..3498955 rows=12499775 width=20)
                                    ->  Parallel Seq Scan on bench_wide  (cost=0..1737926 rows=12499775 width=20)

    Execution times: 9197ms, 9675ms, 9056ms, 11078ms
    Median: ~9.4s

  Verification — unpatched binary, parallel forced with parallel_tuple_cost=0.001:

    Same parallel plan structure as patched.
    Execution times: 9023ms, 9646ms, 9548ms, 11196ms
    Median: ~9.6s

  Summary:
    Unpatched (serial, planner's choice):   ~26s
    Patched   (parallel, planner's choice):  ~9.4s
    Speedup: 2.7x

    The parallel plan is genuinely faster.  The unpatched planner refused
    to pick it because the flat 0.1/tuple * 22.5M rows = 2.25M Gather
    cost made the parallel total (5.75M) appear close to the serial total
    (6.73M), and the serial plan avoided the Finalize GroupAggregate
    overhead.  With width adjustment, the Gather cost drops to 1.28M,
    making the parallel plan clearly cheaper.


Benchmark 2: Wide-Output Aggregate (No Regression)
---------------------------------------------------

Table setup:

    CREATE TABLE bench_narrow AS
    SELECT
      i AS id,
      (i % 500000) AS group_id,
      (random() * 1000)::numeric(10,2) AS val1,
      (random() * 1000)::numeric(10,2) AS val2,
      (random() * 1000)::numeric(10,2) AS val3,
      (random() * 1000)::numeric(10,2) AS val4,
      (random() * 1000)::numeric(10,2) AS val5,
      (random() * 1000)::numeric(10,2) AS val6,
      (random() * 1000)::numeric(10,2) AS val7,
      (random() * 1000)::numeric(10,2) AS val8
    FROM generate_series(1, 20000000) i;
    VACUUM ANALYZE bench_narrow;

  20M rows, 1776 MB on disk.
  10 columns: id int4 (4 bytes), group_id int4 (4 bytes),
              val1..val8 numeric(10,2) (avg 6 bytes each).
  500K distinct group_id values (40 rows per group).
  Source rows are narrow (avg ~52 bytes) but the aggregate output is
  wide: group_id + count + 4 sums + 4 avgs (each avg expands to
  sum + count internally) = width 268 at Gather Merge.

Query:

    SELECT group_id,
           count(*), sum(val1), sum(val2), sum(val3), sum(val4),
           avg(val5), avg(val6), avg(val7), avg(val8)
    FROM bench_narrow
    GROUP BY group_id
    ORDER BY count(*) DESC
    LIMIT 10;

With 4 workers and 500K groups, this produces ~2M partial aggregate
rows (width 268) through Gather Merge.  The width-adjusted cost
correctly charges MORE for these wide tuples:

  Unpatched Gather Merge: cost 1,372,038  (0.1 * 2.08M = 208K contribution)
  Patched   Gather Merge: cost 1,702,787  (0.1 * 2.412 * 2.08M = 502K contribution)

Both patched and unpatched choose the same parallel plan.

Results:

  UNPATCHED:

    Limit  (cost=1492668..1492668 rows=10 width=268)
      ->  Sort  (cost=1492668..1493970 rows=520832 width=268)
            ->  Finalize GroupAggregate  (cost=1122591..1481413 rows=520832 width=268)
                  ->  Gather Merge  (cost=1122591..1372038 rows=2083328 width=268)
                        Workers Planned: 4
                        ->  Sort  (cost=1121591..1122893 rows=520832 width=268)
                              ->  Partial HashAggregate  (cost=892982..1006267 rows=520832 width=268)
                                    ->  Parallel Seq Scan on bench_narrow  (cost=0..277330 rows=5000216 width=52)

    Execution times: 6525ms, 6575ms, 6447ms
    Median: ~6.5s

  PATCHED:

    Limit  (cost=1823417..1823417 rows=10 width=268)
      ->  Sort  (cost=1823417..1824719 rows=520832 width=268)
            ->  Finalize GroupAggregate  (cost=1122591..1812162 rows=520832 width=268)
                  ->  Gather Merge  (cost=1122591..1702787 rows=2083328 width=268)
                        Workers Planned: 4
                        ->  Sort  (cost=1121591..1122893 rows=520832 width=268)
                              ->  Partial HashAggregate  (cost=892982..1006267 rows=520832 width=268)
                                    ->  Parallel Seq Scan on bench_narrow  (cost=0..277330 rows=5000216 width=52)

    Execution times: 6784ms, 6869ms, 7047ms
    Median: ~6.9s

  Summary:
    Same plan on both.  Patched estimated cost is 22% higher (1.82M vs
    1.49M) because it correctly charges 2.41x the base rate for width-268
    tuples.  Execution times are within noise — the higher cost estimate
    does not cause a regression to serial.