Proposal to introduce a shuffle function to intarray extension

From: Martin Kalcher <martin(dot)kalcher(at)aboutsource(dot)net>
To: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Proposal to introduce a shuffle function to intarray extension
Date: 2022-07-15 08:36:06
Message-ID: 9d160a44-7675-51e8-60cf-6d64b76db831@aboutsource.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers

Dear list,

i am dealing with an application that processes fairly large arrays of
integers. It makes heavy use of the intarray extension, which works
great in most cases. However, there are two requirements that cannot be
addressed by the extension and are rather slow with plain SQL. Both can
be met with shuffling:

- Taking n random members from an integer array
- Splitting an array into n chunks, where each member is assigned to a
random chunk

Shuffling is currently implemented by unnesting the array, ordering the
members by random() and aggregating them again.

create table numbers (arr int[]);

insert into numbers (arr)
select array_agg(i)
from generate_series(1, 4000000) i;

select arr[1:3]::text || ' ... ' || arr[3999998:4000000]::text
from (
select array_agg(n order by random()) arr
from (
select unnest(arr) n from numbers
) plain
) shuffled;

---------------------------------------------------------
{2717290,3093757,2426384} ... {3011871,1402540,1613647}

Time: 2348.961 ms (00:02.349)

I wrote a small extension (see source code below) to see how much we can
gain, when the shuffling is implemented in C and the results speak for
themselves:

select arr[1:3]::text || ' ... ' || arr[3999998:4000000]::text
from (
select shuffle(arr) arr from numbers
) shuffled;

----------------------------------------------------
{1313971,3593627,86630} ... {50764,430410,3901128}

Time: 132.151 ms

I would like to see a function like this inside the intarray extension.
Is there any way to get to this point? How is the process to deal with
such proposals?

Best regards,
Martin Kalcher

Source code of extension mentioned above:

#include "postgres.h"
#include "port.h"
#include "utils/array.h"

PG_MODULE_MAGIC;

PG_FUNCTION_INFO_V1(shuffle);

void _shuffle(int32 *a, int len);

Datum
shuffle(PG_FUNCTION_ARGS)
{
ArrayType *a = PG_GETARG_ARRAYTYPE_P_COPY(0);

int len;

if (array_contains_nulls(a))
ereport(ERROR,
(errcode(ERRCODE_NULL_VALUE_NOT_ALLOWED),
errmsg("array must not contain nulls")));

len = ArrayGetNItems(ARR_NDIM(a), ARR_DIMS(a));

if (len > 1)
_shuffle((int32 *) ARR_DATA_PTR(a), len);

PG_RETURN_POINTER(a);
}

void
_shuffle(int32 *a, int len) {
int i, j;
int32 tmp;

for (i = len - 1; i > 0; i--) {
j = random() % (i + 1);
tmp = a[i];
a[i] = a[j];
a[j] = tmp;
}
}

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Stephen Frost 2022-07-15 17:33:20 Re: About revoking large number of privileges; And the PUBLIC role.
Previous Message Aleš Zelený 2022-07-14 21:31:19 Re: PostgreSQL 14.4 ERROR: out of memory issues

Browse pgsql-hackers by date

  From Date Subject
Next Message Wenjing Zeng 2022-07-15 08:37:16 Re: [Commitfest 2022-07] Begins Now
Previous Message Alvaro Herrera 2022-07-15 08:12:12 Re: Allowing REINDEX to have an optional name