Quick Links

Re: tuple radix sort

From:	John Naylor <johncnaylorls(at)gmail(dot)com>
To:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Cc:	Peter Geoghegan <pg(at)bowt(dot)ie>
Subject:	Re: tuple radix sort
Date:	2025-11-04 11:27:08
Message-ID:	CANWCAZYpGMDSSwAa18fOxJGXaPzVdyPsWpOkfCX32DWh3Qznzw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

I wrote:

> I've set the threshold to 400 for now, but I'm not claiming that's the
> end story. In addition to the underestimation mentioned above,
> unrolling the counting step is a factor. Unrolling makes smaller
> inputs worse (which we can reach by recursing from larger inputs), but
> unrolling seems important for large inputs with low cardinality (a few
> percent, but I haven't shared numbers yet). We've found that a large
> input with only 4-5 distinct values just barely wins with radix sort.
> I'll be curious to see if unrolling is actually needed to prevent
> regressions there.

Looking more closely at my development history, it turns out I added
loop unrolling before adding common prefix detection. Most real-world
non-negative integers have the upper bytes the same, especially since
the datum is 8 bytes regardless of underlying type. For those bytes,
the radix sort finds only one unique byte and continues on to the next
byte. By detecting the common prefix as we condition the datums, it
matters less how fast we can count since we simply skip some useless
work. (This is not as relevant when we have an abbreviated datum)

Repeating part of the microbenchmark from last time with no unrolling,
a threshold of 200 works for all but the lowest cardinality inputs:

cardinality: 256
number of elements: 100 qsort: 34.8 radix: 38.3
number of elements: 200 qsort: 34.9 radix: 29.7
number of elements: 400 qsort: 40.8 radix: 29.2

cardinality: 16
number of elements: 100 qsort: 19.3 radix: 26.2
number of elements: 200 qsort: 18.9 radix: 18.2
number of elements: 400 qsort: 18.5 radix: 14.5

cardinality: 2
number of elements: 100 qsort: 09.3 radix: 21.6
number of elements: 200 qsort: 08.9 radix: 15.4
number of elements: 400 qsort: 10.3 radix: 14.0

To test further, I dug up a test from [1] that stresses low
cardinality on multiple sort keys (attached in a form to allow turing
radix sort on and off via a command line argument), and found no
regression with or without loop unrolling:

V2:
off:
4 ^ 8: latency average = 101.070 ms
5 ^ 8: latency average = 704.862 ms
6 ^ 8: latency average = 3651.015 ms
7 ^ 8: latency average = 15141.412 ms

on:
4 ^ 8: latency average = 99.939 ms
5 ^ 8: latency average = 683.018 ms
6 ^ 8: latency average = 3545.626 ms
7 ^ 8: latency average = 14095.677 ms

V3:
off:
4 ^ 8: latency average = 99.486 ms
5 ^ 8: latency average = 693.434 ms
6 ^ 8: latency average = 3607.940 ms
7 ^ 8: latency average = 14602.325 ms

on:
4 ^ 8: latency average = 97.664 ms
5 ^ 8: latency average = 678.752 ms
6 ^ 8: latency average = 3361.765 ms
7 ^ 8: latency average = 14121.190 ms

So v3 gets rid of loop unrolling, reduces the threshold to 200.

[1] https://www.postgresql.org/message-id/flat/CAApHDvqkHZsT2gaAWFM7D%3D7qyQ%3DeKXQvvn%2BBkwCn4Rvj1w4EKQ%40mail.gmail.com#1c67321e09be15e031d3995d45a1b8fd

TODOs:
- adding a "sorted pre-check" to keep up with our qsort for ascending inputs
- further performance validation
- possibly doing NULL partitioning differently
- possibly specializing qsort on the NULL partition
- code cleanup

--
John Naylor
Amazon Web Services

Attachment	Content-Type	Size
bench_cartesiansort.sh	application/x-sh	1.2 KB
v3-0001-Use-radix-sort-when-datum1-is-an-integer-type.patch	application/x-patch	16.6 KB

In response to

Re: tuple radix sort at 2025-11-03 13:24:35 from John Naylor

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	jian he	2025-11-04 11:43:49	Re: COPY WHERE clause generated/system column reference
Previous Message	Jim Jones	2025-11-04 11:25:05	Re: [PATCH] Add pg_get_tablespace_ddl() function to reconstruct CREATE TABLESPACE statement