Re: tuple radix sort

From: Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com>
To: John Naylor <johncnaylorls(at)gmail(dot)com>
Cc: Chengpeng Yan <chengpeng_yan(at)outlook(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Peter Geoghegan <pg(at)bowt(dot)ie>
Subject: Re: tuple radix sort
Date: 2025-11-27 07:48:58
Message-ID: F7935986-4E3D-477C-B45C-878420B45C0D@gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi John,

I did an initial test before, but I didn’t read the code at the time. Today, I spent time reviewing 0001. Overall, I believe the implementation is solid. I just got a few comments/suggestions.

> On Nov 26, 2025, at 21:11, John Naylor <johncnaylorls(at)gmail(dot)com> wrote:
>
>
> For v5 I've also added CHECK_FOR_INTERRUPTS and rewrote some comments.
>
>
> --
> John Naylor
> Amazon Web Services
> <v5-0001-Use-radix-sort-when-SortTuple-contains-a-pass-by-.patch><v5-0004-Detect-common-prefix-to-avoid-wasted-work-during-.patch><v5-0003-WIP-make-some-regression-tests-sort-order-more-de.patch><v5-0002-WIP-Adjust-regression-tests.patch>

1 - 0001
```
+ /* extract the byte for this level from the normalized datum */
+ current_byte = extract_byte(normalize_datum(tup->datum1, ssup),
+ level);
+
+ /* save it for the permutation step */
+ tup->current_byte = current_byte;
```

We recompute normalize_datum(tup->datum1, ssup) for every tuple in every level, why don’t cache the result in SortTuple. As we have cached current_byte in SortTuple, it shouldn’t be a big deal to add one more field to it.

2 - 0001
```
+ while (num_remaining > 1)
+ {
+ /* start the count over */
+ num_remaining = num_partitions;
```

The swap loop always start the count over, so that sorted partitions are re-scanned as well. I think we can do an optimization like:

```
num_active = num_partitions;
while (num_active > 1)
{
for (int i = 0; i < num_active; )
{
uint8 idx = remaining_partitions[i];
// do the swaps for the partition …

if (partitions[idx].offset == partitions[idx].next_offset)
{
remaining_partitions[i] = remaining_partitions[num_active - 1];
num_active--;
}
else
i++;
}
}
```

This way we move out sorted partitions, so they will not be re-scanned.

3 - 0001

In sort_byvalue_datum, we can add a fast-path for all NULL and all NOT NULL cases, so that they won’t need to run the branchless cyclic permutation and the two “for” loops of assertions. Something like:

```
while (d1 < state->memtupcount && data[d1].isnull1 == nulls_first)
d1++;

null_count = d1;
not_null_count = state->memtupcount - d1;

/* fast paths: all on one side */
if (null_count == 0 || not_null_count == 0)
{
if (nulls_first)
{
null_start = data;
not_null_start = data + null_count;
}
else
{
not_null_start = data;
null_start = data + not_null_count;
}

/* only one partition is non-empty; sort it and return */
if (not_null_count > 1)
{
if (not_null_count < QSORT_THRESHOLD)
qsort_tuple(not_null_start, not_null_count, state->base.comparetup, state);
else
radix_sort_tuple(not_null_start, not_null_count, 0, state);
}
else if (null_count > 1 && state->base.onlyKey == NULL)
{
qsort_tuple(null_start, null_count, state->base.comparetup_tiebreak, state);
}
return;
}

/* ... existing branchless cyclic permutation ... */

```

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bryan Green 2025-11-27 07:52:33 Re: [Patch] Windows relation extension failure at 2GB and 4GB
Previous Message Michael Paquier 2025-11-27 07:38:16 Re: Move WAL/RMGR sequence code into its own file and header