Quick Links

Re: scale parallel_tuple_cost by tuple width

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	David Rowley <dgrowleyml(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: scale parallel_tuple_cost by tuple width
Date:	2026-04-01 11:15:40
Message-ID:	2d7c4e54-6a0b-4b1a-87a6-278c49fa52f0@dunslane.net
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 2026-03-30 Mo 6:51 PM, David Rowley wrote:
> On Tue, 31 Mar 2026 at 03:17, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
>>> While investigating a performance issue, I found that it was extremely
>>> difficult to get a parallel plan in some cases due to the fixed
>>> parallel_tuple_cost. But this cost is not really fixed - it's going to
>>> be larger for larger tuples. So this proposal adjusts the cost used
>>> according to how large we expect the results to be.
>> Unfortunately, I'm afraid that this is going to produce mostly
>> "garbage in, garbage out" estimates, because our opinion of how wide
>> tuples-in-flight are is pretty shaky. (See get_expr_width and
>> particularly get_typavgwidth, and note that we only have good
>> statistics-based numbers for plain Vars not function outputs.)
>> I agree that it could be useful to have some kind of adjustment here,
>> but I fear that linear scaling is putting way too much faith in the
>> quality of the data.
> (I suspect you're saying this because of the "Benchmark 2" in the text
> file, which contains aggregates which return a varlena type, of which
> we won't estimate the width very well for...)
>
> Sure, it's certainly true that there are cases where we don't get the
> width estimate right, but that's not stopped us using them before. So
> why is this case so much more critical? We now also have GROUP BY
> before join abilities in the planner, which I suspect must also be
> putting trust into the very same thing. Also, varlena-returning
> Aggrefs aren't going to be the Gather/GatherMerge targetlist, so why
> avoid making improvements in this area because we're not great at one
> of the things that could be in the targetlist?
>
> For the patch and the analysis: This reminds me of [1], where some
> reverse-engineering of costs from query run-times was done, which
> ended up figuring out what we set APPEND_CPU_COST_MULTIPLIER to. To
> get that for this case would require various tests with different
> tuple widths and ensuring that the costs scale linearly with the
> run-time of the query with the patched version. Of course, the test
> query would have to have perfect width estimates, but that could be
> easy enough to do by populating a text typed GROUP BY column and
> populating that with all the same width of text for each of the tests
> before increasing the width for the next test, using a fixed-width
> aggregate each time, e.g count(*). The "#define
> PARALLEL_TUPLE_COST_REF_WIDTH 100" does seem to be quite a round
> number. It would be good to know how close this is to reality.
> Ideally, it would be good to see results from an Apple M<something>
> chip and recent x86. In my experience, these produce very different
> performance results, so it might be nice to find a value that is
> somewhere in the middle of what we get from those machines. I suspect
> having the GROUP BY column with text widths from 8 to 1024, increasing
> in powers of two would be enough data points.

I followed your suggested methodology to measure how Gather IPC
cost actually scales with tuple width.

Setup: 10M rows, 100K distinct text values per table, text column
padded to a fixed width with lpad(). Query: SELECT txt, count(*)
FROM ptc_bench_W GROUP BY txt. This produces Partial HashAggregate
in workers, then Gather passes ~240K partial-aggregate tuples whose
width is dominated by the text column. 2 workers, work_mem=256MB,
cassert=off, debugoptimized build, aarch64 Linux.

I tested widths from 8 to 1024 bytes (10 data points). For each
width, I ran 5 iterations of both parallel and serial execution,
and computed the Gather overhead as:

overhead = T_parallel - T_serial / 3

This isolates the IPC cost: the serial time captures pure scan +
aggregate work, and dividing by 3 gives the ideal parallel time
(2 workers + leader). The excess is Gather overhead.

Results (microseconds per tuple through Gather, median of 5 runs):

Width(B) us/tuple   Implied ptc (if ptc=0.1 at w=100)
-------- --------   ----------------------------------
         8     0.30     0.032
        16     0.24     0.025
        32     0.77     0.083
        64     0.72     0.078
       128     1.03     0.111
       256     1.62     0.175
       384     2.90     0.313
       512     3.21     0.346
       768     4.12     0.445
      1024     5.56     0.600

The best-fit models:

Linear: cost(w) = 0.42 + 0.0051 * w R² = 0.983
Power law: cost(w) = 0.061 * w^0.63 R² = 0.966

Linear fits best.

One notable finding: at the proposed reference width of 100 bytes,
the total predicted cost is 0.42 + 0.51 = 0.93 us/tuple, of
which 0.42 is fixed.
The original patch used PARALLEL_TUPLE_COST_FIXED_FRAC = 0.10,
which substantially underestimates the width-independent component.
A higher fixed fraction would dampen the width adjustment, which
also partly addresses Tom's concern about sensitivity to width
estimate errors: with ~45% of the cost being fixed, even a 2x
error in width only translates to a ~1.5x error in total cost.

The script used to get the timings is attached.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Attachment	Content-Type	Size
ptc_calibrate.sh	application/x-shellscript	5.2 KB

In response to

Re: scale parallel_tuple_cost by tuple width at 2026-03-30 22:51:35 from David Rowley

Responses

Re: scale parallel_tuple_cost by tuple width at 2026-04-01 18:02:58 from Tom Lane

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Ashutosh Bapat	2026-04-01 11:17:07	Re: Online PostgreSQL version() updates
Previous Message	Nisha Moond	2026-04-01 11:11:35	Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?