Re: [PATCH] Introduce array_shuffle() and array_sample()

From: Martin Kalcher <martin(dot)kalcher(at)aboutsource(dot)net>
To: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
Cc: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, John Naylor <john(dot)naylor(at)enterprisedb(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [PATCH] Introduce array_shuffle() and array_sample()
Date: 2022-07-24 19:29:39
Message-ID: f5af7698-d1eb-bdf4-9969-82b18ab4f8cf@aboutsource.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers

Am 24.07.22 um 10:15 schrieb Fabien COELHO:
>
> My 0,02€ about this patch:

Thank you for your feedback. I attached a patch, that addresses most of
your points.

> Use (void) when declaring no parameters in headers or in functions.

Done.

> Should the exchange be skipped when i == k?

The additional branch is actually slower (on my machine, tested with an
10M element int array) for 1D-arrays, which i assume is the most common
case. Even with a 2D-array with a subarray size of 1M there is no
difference in execution time. i == k should be relatively rare for large
arrays, given a good prng.

> I do not see the point of having *only* inline functions in a c file.
> The external functions should not be inlined?

Done.

> The random and array shuffling functions share a common state. I'm
> wondering whether it should ot should not be so. It seems ok, but then
> ISTM that the documentation suggests implicitely that setseed applies to
> random() only, which is not the case anymore after the patch.

I really like the idea of a prng state owned by the user, that is used
by all user facing random functions, but see that their might be
concerns about this change. I would update the setseed() documentation,
if this proposal is accepted. Do you think my patch should already
include the documentation update?

> If more samples are required than the number of elements, it does not
> error out. I'm wondering whether it should.

The second argument to array_sample() is treated like a LIMIT, rather
than the actual number of elements. I updated the documentation. My
use-case is: take max random items from an array of unknown size.

> Also, the sampling should not return its result in order when the number
> of elements required is the full array, ISTM that it should be shuffled
> there as well.

You are the second person asking for this. It's done. I am thinking
about ditching array_sample() and replace it with a second signature for
array_shuffle():

array_shuffle(array anyarray) -> anyarray
array_shuffle(array anyarray, limit int) -> anyarray

What do you think?

> I must say that without significant knowledge of the array internal
> implementation, the swap code looks pretty strange. ISTM that the code
> would be clearer if pointers and array references style were not
> intermixed.

Done. Went with pointers.

> Maybe you could add a test with a 3D array? Some sample with NULLs?

Done.

> Unrelated: I notice again that (postgre)SQL does not provide a way to
> generate random integers. I do not see why not. Should we provide one?

Maybe. I think it is outside the scope of this patch.

Thank you, Martin

Attachment Content-Type Size
0001-Introduce-array_shuffle-and-array_sample.patch text/x-patch 16.7 KB

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Fabien COELHO 2022-07-24 19:42:48 Re: [PATCH] Introduce array_shuffle() and array_sample()
Previous Message xavier 2022-07-24 16:50:02 Re: Queries in another user's tables (SOLVED)

Browse pgsql-hackers by date

  From Date Subject
Next Message Fabien COELHO 2022-07-24 19:42:48 Re: [PATCH] Introduce array_shuffle() and array_sample()
Previous Message Noah Misch 2022-07-24 19:10:17 Re: fairywren hung in pg_basebackup tests