Re: tuple radix sort

From: John Naylor <johncnaylorls(at)gmail(dot)com>
To: Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com>
Cc: Chengpeng Yan <chengpeng_yan(at)outlook(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Peter Geoghegan <pg(at)bowt(dot)ie>
Subject: Re: tuple radix sort
Date: 2025-11-27 11:45:32
Message-ID: CANWCAZaKRzsYPn9YWRR8g5dkX-ep_jgyUb4r3KFekK5ZTNo7ew@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Nov 27, 2025 at 2:49 PM Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com> wrote:
> I did an initial test before, but I didn’t read the code at the time. Today, I spent time reviewing 0001. Overall, I believe the implementation is solid. I just got a few comments/suggestions.

Thanks for looking!

> > On Nov 26, 2025, at 21:11, John Naylor <johncnaylorls(at)gmail(dot)com> wrote:

> > <v5-0001-Use-radix-sort-when-SortTuple-contains-a-pass-by-.patch><v5-0004-Detect-common-prefix-to-avoid-wasted-work-during-.patch><v5-0003-WIP-make-some-regression-tests-sort-order-more-de.patch><v5-0002-WIP-Adjust-regression-tests.patch>

> We recompute normalize_datum(tup->datum1, ssup) for every tuple in every level, why don’t cache the result in SortTuple. As we have cached current_byte in SortTuple, it shouldn’t be a big deal to add one more field to it.

Actually it is a big deal, because the memtuples array counts against work_mem.

I saw two ways to store the full normalized without increasing the
size, and I've already rejected them:
- share space with isnull1/srctape -- v3 did this, and I already
explained the reason for changing when I shared v4.
- share space with datum1 -- that would require additional code to
restore the original datum and makes it more difficult to verify
correctness

There is also a proposal upthread to not store anything, and that's
still up in the air.

> 2 - 0001
> ```
> + while (num_remaining > 1)
> + {
> + /* start the count over */
> + num_remaining = num_partitions;
> ```
>
> The swap loop always start the count over, so that sorted partitions are re-scanned as well. I think we can do an optimization like:
>
> ```
> num_active = num_partitions;
> while (num_active > 1)
> {
> for (int i = 0; i < num_active; )
> {
> uint8 idx = remaining_partitions[i];
> // do the swaps for the partition …
>
> if (partitions[idx].offset == partitions[idx].next_offset)
> {
> remaining_partitions[i] = remaining_partitions[num_active - 1];
> num_active--;
> }
> else
> i++;
> }
> }
> ```
>
> This way we move out sorted partitions, so they will not be re-scanned.

I don't think that's going to work without additional bookkeeping, so
I'm not sure it's worth it. See the recursion step.

> 3 - 0001
>
> In sort_byvalue_datum, we can add a fast-path for all NULL and all NOT NULL cases, so that they won’t need to run the branchless cyclic permutation and the two “for” loops of assertions. Something like:

This is too clever and yet doesn't go far enough.

There is already one fast path, which happens to cover the common ASC
+ all NOT NULL case. The right way to skip the permutation step would
be to add a second loop that starts from the end and stops at the
first tuple that needs to swap. That should work not just for all NULL
and all NOT NULL, but also where there is a mix of the two and some
(or all) are already in place. All these cases can be treated the same
and they will continue to the exact same paths.

I haven't yet bothered to try harder, but it may be necessary in order
to avoid introducing regressions, so I'll look into it.

--
John Naylor
Amazon Web Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Nazir Bilal Yavuz 2025-11-27 12:07:17 Re: Remove unused function parameters, part 1: contrib
Previous Message Soumya S Murali 2025-11-27 11:39:49 Re: [PATCH] Expose checkpoint timestamp and duration in pg_stat_checkpointer