Re: tuple radix sort

From: John Naylor <johncnaylorls(at)gmail(dot)com>
To: cca5507 <cca5507(at)qq(dot)com>
Cc: zengman <zengman(at)halodbtech(dot)com>, pgsql-hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: tuple radix sort
Date: 2026-03-30 05:57:52
Message-ID: CANWCAZZuQPMAPoCUQo7E2v2Zk5z9T=8bE=VGpC8_TYG3aomGyQ@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:
> Next step : Test whether it's worth it for the common prefix detection
> to use the normalized datum.

This turned out to be a loser, but in the course of trying it a better
idea occurred to me. v8's prefix detection was really a special-case
optimization where the sort key is all non-negative integers (or all
negative, but that's not common). It's wasted work when the input is
mixed in sign, and for abbreviated keys. It's not much of a waste, but
we can do better.

v9 computes the common prefix during every recursion at the same time
we populate the SortTuple's current byte. That should be practically
free given a modest amount of instruction-level parallelism.

As a demo, I slightly modified the "p5" test (an adversarial case for
radix sort) from

https://www.postgresql.org/message-id/CANWCAZb0STJRAnn8fcVccQmektizNkgqBD_cg%2B0cs1%3Db4WWajQ%40mail.gmail.com

...so that all bytes differ only in the least significant byte or in
the sign bit.

select setseed(0.935349);
with r (rand, i) as (select random(-100, 100), i
from generate_series(1,1_000_000,1) i )
insert into tsort
select case
when r.rand < -95 or r.rand > 95
then r.rand
else 0 end
from r;

vacuum freeze tsort ;

cat bench.sql
select * from tsort order by i offset 100000000;

pgbench -n -t 500 -f bench.sql | grep latency
(on pre-warmed buffers with work_mem 64MB)

master:
latency average = 99.663 ms
v8:
latency average = 102.040 ms
v9:
latency average = 84.856 ms

Here we can see that common prefix detection is useless in v8. It adds
work, but it's not worse than noise in this benchmark, so could be
disregarded. In v9 we proceed directly to performing a radix sort pass
on the most significant byte, which effectively partitions into signed
and unsigned. The next recursion step detects that all bytes are the
same and simultaneously calculates how far to skip so that the next
recursion is productive.

For the case of all non-negative integers:

select setseed(0.935349);
insert into tsort
select random(0, 100)
from generate_series(1,1_000_000,1) i;
vacuum freeze tsort ;

...in v9 the first pass computes the current byte from the normalized
datum and stores it, but this pass is unproductive. There was a risk
that this is more expensive than v8's up-front common prefix detection
because of memory writes. I only see noise-level differences:

master:
latency average = 98.725 ms
v8:
latency average = 75.572 ms
v9:
latency average = 76.635 ms

I intend to commit v9-0001 this week.

I've decided not to commit v8-0003 (message changes) from my last
message since it doesn't really add any value IMO. That can always be
revisited. I'm on the fence about v8-0004 (assert the correct order
also for qsort), but it seems good for consistency. I'd like to at
least continue asserting this for radix sort since it's new this
cycle. To save CI cycles etc, maybe we can eventually hide the check
behind a debug symbol.

--
John Naylor
Amazon Web Services

Attachment Content-Type Size
v9-0001-Detect-common-prefixes-during-radix-sort.patch text/x-patch 3.8 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Smith 2026-03-30 05:58:15 Re: Skipping schema changes in publication
Previous Message Zhijie Hou (Fujitsu) 2026-03-30 05:51:23 RE: Skipping schema changes in publication