| From: | John Naylor <johncnaylorls(at)gmail(dot)com> |
|---|---|
| To: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Cc: | Peter Geoghegan <pg(at)bowt(dot)ie> |
| Subject: | Re: tuple radix sort |
| Date: | 2025-11-04 11:27:08 |
| Message-ID: | CANWCAZYpGMDSSwAa18fOxJGXaPzVdyPsWpOkfCX32DWh3Qznzw@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
I wrote:
> I've set the threshold to 400 for now, but I'm not claiming that's the
> end story. In addition to the underestimation mentioned above,
> unrolling the counting step is a factor. Unrolling makes smaller
> inputs worse (which we can reach by recursing from larger inputs), but
> unrolling seems important for large inputs with low cardinality (a few
> percent, but I haven't shared numbers yet). We've found that a large
> input with only 4-5 distinct values just barely wins with radix sort.
> I'll be curious to see if unrolling is actually needed to prevent
> regressions there.
Looking more closely at my development history, it turns out I added
loop unrolling before adding common prefix detection. Most real-world
non-negative integers have the upper bytes the same, especially since
the datum is 8 bytes regardless of underlying type. For those bytes,
the radix sort finds only one unique byte and continues on to the next
byte. By detecting the common prefix as we condition the datums, it
matters less how fast we can count since we simply skip some useless
work. (This is not as relevant when we have an abbreviated datum)
Repeating part of the microbenchmark from last time with no unrolling,
a threshold of 200 works for all but the lowest cardinality inputs:
cardinality: 256
number of elements: 100 qsort: 34.8 radix: 38.3
number of elements: 200 qsort: 34.9 radix: 29.7
number of elements: 400 qsort: 40.8 radix: 29.2
cardinality: 16
number of elements: 100 qsort: 19.3 radix: 26.2
number of elements: 200 qsort: 18.9 radix: 18.2
number of elements: 400 qsort: 18.5 radix: 14.5
cardinality: 2
number of elements: 100 qsort: 09.3 radix: 21.6
number of elements: 200 qsort: 08.9 radix: 15.4
number of elements: 400 qsort: 10.3 radix: 14.0
To test further, I dug up a test from [1] that stresses low
cardinality on multiple sort keys (attached in a form to allow turing
radix sort on and off via a command line argument), and found no
regression with or without loop unrolling:
V2:
off:
4 ^ 8: latency average = 101.070 ms
5 ^ 8: latency average = 704.862 ms
6 ^ 8: latency average = 3651.015 ms
7 ^ 8: latency average = 15141.412 ms
on:
4 ^ 8: latency average = 99.939 ms
5 ^ 8: latency average = 683.018 ms
6 ^ 8: latency average = 3545.626 ms
7 ^ 8: latency average = 14095.677 ms
V3:
off:
4 ^ 8: latency average = 99.486 ms
5 ^ 8: latency average = 693.434 ms
6 ^ 8: latency average = 3607.940 ms
7 ^ 8: latency average = 14602.325 ms
on:
4 ^ 8: latency average = 97.664 ms
5 ^ 8: latency average = 678.752 ms
6 ^ 8: latency average = 3361.765 ms
7 ^ 8: latency average = 14121.190 ms
So v3 gets rid of loop unrolling, reduces the threshold to 200.
TODOs:
- adding a "sorted pre-check" to keep up with our qsort for ascending inputs
- further performance validation
- possibly doing NULL partitioning differently
- possibly specializing qsort on the NULL partition
- code cleanup
--
John Naylor
Amazon Web Services
| Attachment | Content-Type | Size |
|---|---|---|
| bench_cartesiansort.sh | application/x-sh | 1.2 KB |
| v3-0001-Use-radix-sort-when-datum1-is-an-integer-type.patch | application/x-patch | 16.6 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | jian he | 2025-11-04 11:43:49 | Re: COPY WHERE clause generated/system column reference |
| Previous Message | Jim Jones | 2025-11-04 11:25:05 | Re: [PATCH] Add pg_get_tablespace_ddl() function to reconstruct CREATE TABLESPACE statement |