| From: | Andrew Dunstan <andrew(at)dunslane(dot)net> |
|---|---|
| To: | David Rowley <dgrowleyml(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
| Cc: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: scale parallel_tuple_cost by tuple width |
| Date: | 2026-04-01 11:15:40 |
| Message-ID: | 2d7c4e54-6a0b-4b1a-87a6-278c49fa52f0@dunslane.net |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On 2026-03-30 Mo 6:51 PM, David Rowley wrote:
> On Tue, 31 Mar 2026 at 03:17, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
>>> While investigating a performance issue, I found that it was extremely
>>> difficult to get a parallel plan in some cases due to the fixed
>>> parallel_tuple_cost. But this cost is not really fixed - it's going to
>>> be larger for larger tuples. So this proposal adjusts the cost used
>>> according to how large we expect the results to be.
>> Unfortunately, I'm afraid that this is going to produce mostly
>> "garbage in, garbage out" estimates, because our opinion of how wide
>> tuples-in-flight are is pretty shaky. (See get_expr_width and
>> particularly get_typavgwidth, and note that we only have good
>> statistics-based numbers for plain Vars not function outputs.)
>> I agree that it could be useful to have some kind of adjustment here,
>> but I fear that linear scaling is putting way too much faith in the
>> quality of the data.
> (I suspect you're saying this because of the "Benchmark 2" in the text
> file, which contains aggregates which return a varlena type, of which
> we won't estimate the width very well for...)
>
> Sure, it's certainly true that there are cases where we don't get the
> width estimate right, but that's not stopped us using them before. So
> why is this case so much more critical? We now also have GROUP BY
> before join abilities in the planner, which I suspect must also be
> putting trust into the very same thing. Also, varlena-returning
> Aggrefs aren't going to be the Gather/GatherMerge targetlist, so why
> avoid making improvements in this area because we're not great at one
> of the things that could be in the targetlist?
>
> For the patch and the analysis: This reminds me of [1], where some
> reverse-engineering of costs from query run-times was done, which
> ended up figuring out what we set APPEND_CPU_COST_MULTIPLIER to. To
> get that for this case would require various tests with different
> tuple widths and ensuring that the costs scale linearly with the
> run-time of the query with the patched version. Of course, the test
> query would have to have perfect width estimates, but that could be
> easy enough to do by populating a text typed GROUP BY column and
> populating that with all the same width of text for each of the tests
> before increasing the width for the next test, using a fixed-width
> aggregate each time, e.g count(*). The "#define
> PARALLEL_TUPLE_COST_REF_WIDTH 100" does seem to be quite a round
> number. It would be good to know how close this is to reality.
> Ideally, it would be good to see results from an Apple M<something>
> chip and recent x86. In my experience, these produce very different
> performance results, so it might be nice to find a value that is
> somewhere in the middle of what we get from those machines. I suspect
> having the GROUP BY column with text widths from 8 to 1024, increasing
> in powers of two would be enough data points.
I followed your suggested methodology to measure how Gather IPC
cost actually scales with tuple width.
Setup: 10M rows, 100K distinct text values per table, text column
padded to a fixed width with lpad(). Query: SELECT txt, count(*)
FROM ptc_bench_W GROUP BY txt. This produces Partial HashAggregate
in workers, then Gather passes ~240K partial-aggregate tuples whose
width is dominated by the text column. 2 workers, work_mem=256MB,
cassert=off, debugoptimized build, aarch64 Linux.
I tested widths from 8 to 1024 bytes (10 data points). For each
width, I ran 5 iterations of both parallel and serial execution,
and computed the Gather overhead as:
overhead = T_parallel - T_serial / 3
This isolates the IPC cost: the serial time captures pure scan +
aggregate work, and dividing by 3 gives the ideal parallel time
(2 workers + leader). The excess is Gather overhead.
Results (microseconds per tuple through Gather, median of 5 runs):
Width(B) us/tuple Implied ptc (if ptc=0.1 at w=100)
-------- -------- ----------------------------------
8 0.30 0.032
16 0.24 0.025
32 0.77 0.083
64 0.72 0.078
128 1.03 0.111
256 1.62 0.175
384 2.90 0.313
512 3.21 0.346
768 4.12 0.445
1024 5.56 0.600
The best-fit models:
Linear: cost(w) = 0.42 + 0.0051 * w R² = 0.983
Power law: cost(w) = 0.061 * w^0.63 R² = 0.966
Linear fits best.
One notable finding: at the proposed reference width of 100 bytes,
the total predicted cost is 0.42 + 0.51 = 0.93 us/tuple, of
which 0.42 is fixed.
The original patch used PARALLEL_TUPLE_COST_FIXED_FRAC = 0.10,
which substantially underestimates the width-independent component.
A higher fixed fraction would dampen the width adjustment, which
also partly addresses Tom's concern about sensitivity to width
estimate errors: with ~45% of the cost being fixed, even a 2x
error in width only translates to a ~1.5x error in total cost.
The script used to get the timings is attached.
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com
| Attachment | Content-Type | Size |
|---|---|---|
| ptc_calibrate.sh | application/x-shellscript | 5.2 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Ashutosh Bapat | 2026-04-01 11:17:07 | Re: Online PostgreSQL version() updates |
| Previous Message | Nisha Moond | 2026-04-01 11:11:35 | Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion? |