Quick Links

Re: tuple radix sort

From:	John Naylor <johncnaylorls(at)gmail(dot)com>
To:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Cc:	Peter Geoghegan <pg(at)bowt(dot)ie>
Subject:	Re: tuple radix sort
Date:	2025-11-12 13:57:49
Message-ID:	CANWCAZaN8kW6o7Ymc_jzPO_Z5SqekQTTNra3xTGYTH+cjrVp8g@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Mon, Nov 3, 2025 at 8:24 PM I wrote:

> v1 was careful to restore isnull1 to false when diverting to quicksort
> for the tiebreak. v2 doesn't bother, since the only tiebreak in core
> that looks at isnull1 is comparetup_datum_tiebreak, which is not
> reachable by radix sort, requiring a pass-by-value datum that compares
> like an integer. This is a bit of a risk, since it's possible a third
> party fork could be doing something weird. Seems unlikely, but
> something to keep in mind.

I decided I wasn't quite comfortable with the full normalized datum
sharing space in SortTuple with isnull1. There's too much of a
cognitive burden involved in deciding when we do or don't need to
reset isnull1, and there's a non-zero risk of difficult-to-detect
bugs. For v4 I've instead used one byte of padding space in SortTuple
to store only the byte used for the current pass. That means we must
compute the normalized datum on every pass. That's actually better
than it sounds, since that one byte can now be used directly during
the "deal" step, rather than having to extract the byte from the
normalized datum by shifting and masking. That extraction step might
add significant cycles in cases where a pass requires multiple
iterations through the "deal" loop. It doesn't seem to make much
difference in practice, performance-wise, even with the following
pessimization:

I had to scrap the qsort specialization on the normalized datum for
small sorts, since it's no longer stored. It could still be worth it
to compute the "next byte of the normalized datum" and perform a qsort
on that (falling back to the comparator function in the usual way),
but I haven't felt the need to resort to that yet. For v4, I just
divert to qsort_tuple in non-assert builds, with a threshold of 40.

> - I don't quite like how the NULL partitioning step looks, and it
> could be costly when the distribution of NULL is not predictable, so
> I'm thinking of turning part of that into a branch-free cyclic
> permutation, similar to
> https://www.postgresql.org/message-id/CANWCAZbAmaZ7P%2BARjS97sJLXsBB5CPZyzFgqNDiqe-L%2BBqXzug%40mail.gmail.com

This is done. Even though the inner loop is mysterious at first
glance, it's really quite elegant.

I made an attempt at clean-up, but it's still under-commented. The
common prefix detection has moved to a separate patch (v4-0004).

I've been forcing all eligible sorts to use radix sort in assert
builds, even when small enough that qsort would be faster. Since both
qsort and in-place radix sort are unstable, it's expected that some
regression tests need adjustment (v4-0002). One thing surprised me,
however: The pg_upgrade TAP test that runs regression tests on the old
cluster showed additional failures that I can't explain. I haven't
seen this before, but it's possible I never ran TAP tests when testing
new sort algorithms previously. This doesn't happen if you change the
current insertion sort threshold, so I haven't been able to reproduce
it aside from this patch. For that reason I can't completely rule out
an actual bug, although I actually have more confidence in the
verification of correct sort order in v4, since isnull1 now never
changes, just as in master. I found that changing some tests to have
additional sort keys seems to fix it (v4-0003). I did this in a rather
quick and haphazard fashion. There's probably a longer conversation to
be had about making test output more deterministic while still
covering the intended executor paths.

Aside from that, this seems like a good place to settle down, so I'm
going to create a CF entry for this. I'll start more rigorous
performance testing in the near future.

--
John Naylor
Amazon Web Services

Attachment	Content-Type	Size
v4-0004-Detect-common-prefix-to-avoid-wasted-work-during-.patch	text/x-patch	2.5 KB
v4-0002-WIP-Adjust-regression-tests.patch	text/x-patch	49.3 KB
v4-0003-WIP-make-some-regression-tests-sort-order-more-de.patch	text/x-patch	15.2 KB
v4-0001-Use-radix-sort-when-datum1-is-an-integer-type.patch	text/x-patch	13.5 KB

In response to

Re: tuple radix sort at 2025-11-03 13:24:35 from John Naylor

Responses

Re: tuple radix sort at 2025-11-20 10:57:25 from John Naylor
Re: tuple radix sort at 2025-11-20 11:13:08 from Álvaro Herrera
Re: tuple radix sort at 2025-11-23 05:33:44 from Chengpeng Yan

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Andres Freund	2025-11-12 14:02:42	Re: alignas (C11)
Previous Message	Matheus Alcantara	2025-11-12 13:26:50	Re: Include extension path on pg_available_extensions