Re: [PATCH] Introduce array_shuffle() and array_sample()

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Martin Kalcher <martin(dot)kalcher(at)aboutsource(dot)net>
Cc: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>, Andres Freund <andres(at)anarazel(dot)de>, Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, John Naylor <john(dot)naylor(at)enterprisedb(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [PATCH] Introduce array_shuffle() and array_sample()
Date: 2022-09-29 19:33:26
Message-ID: 344667.1664480006@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers

Martin Kalcher <martin(dot)kalcher(at)aboutsource(dot)net> writes:
> New patch: array_shuffle() and array_sample() use pg_global_prng_state now.

I took a closer look at the patch today. I find this behavior a bit
surprising:

+SELECT array_dims(array_sample('[-1:2][2:3]={{1,2},{3,NULL},{5,6},{7,8}}'::int[], 3));
+ array_dims
+-------------
+ [-1:1][2:3]
+(1 row)

I can buy preserving the lower bound in array_shuffle(), but
array_sample() is not preserving the first-dimension indexes of
the array, so ISTM it ought to reset the first lower bound to 1.

Some other comments:

+ Returns <parameter>n</parameter> randomly chosen elements from <parameter>array</parameter> in selection order.

What's "selection order"? And this probably shouldn't just rely
on the example to describe what happens with multi-D arrays.
Writing "elements" seems somewhere between confusing and wrong.

* Personally I think I'd pass the TypeCacheEntry pointer to array_shuffle_n,
and let it pull out what it needs. Less duplicate code that way.

* I find array_shuffle_n drastically undercommented, and what comments
it has are pretty misleading, eg

+ /* Swap all elements in item (i) with elements in item (j). */

j is *not* the index of the second item to be swapped. You could make
it so, and that might be more readable:

j = (int) pg_prng_uint64_range(&pg_global_prng_state, i, nitem - 1);
jelms = elms + j * nelm;
jnuls = nuls + j * nelm;

But if you want the code to stay as it is, this comment needs work.

* I think good style for SQL-callable C functions is to make the arguments
clear at the top:

+array_sample(PG_FUNCTION_ARGS)
+{
+ ArrayType *array = PG_GETARG_ARRAYTYPE_P(0);
+ int n = PG_GETARG_INT32(1);
+ ArrayType *result;
+ ... other declarations as needed ...

We can't quite make normal C declaration style work, but that's a poor
excuse for not making the argument list visible as directly as possible.

* I wouldn't bother with the PG_FREE_IF_COPY calls. Those are generally
only used in btree comparison functions, in which there's a need to not
leak memory.

* I wonder if we really need these to be proparallel = 'r'. If we let
a worker process execute them, they will take numbers from the worker's
pg_global_prng_state seeding not the parent process's seeding, but why
is that a problem? In the prior version where you were tying them
to the parent's drandom() sequence, proparallel = 'r' made sense;
but now I think it's unnecessary.

regards, tom lane

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Michael Paquier 2022-09-30 00:18:32 Re: Streaming wal from primiry terminating
Previous Message Lahnov, Igor 2022-09-29 13:04:08 RE: Streaming wal from primiry terminating

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2022-09-29 21:58:59 Re: problems with making relfilenodes 56-bits
Previous Message Tom Lane 2022-09-29 18:39:44 Re: making relfilenodes 56 bits