Re: [PATCH] Introduce array_shuffle() and array_sample()

From: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To: Martin Kalcher <martin(dot)kalcher(at)aboutsource(dot)net>
Cc: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, John Naylor <john(dot)naylor(at)enterprisedb(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [PATCH] Introduce array_shuffle() and array_sample()
Date: 2022-07-24 19:42:48
Message-ID: alpine.DEB.2.22.394.2207242132460.1359119@pseudo
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers


Hello,

> Thank you for your feedback. I attached a patch, that addresses most of your
> points.

I'll look into it. It would help if the patch could include a version
number at the end.

>> Should the exchange be skipped when i == k?
>
> The additional branch is actually slower (on my machine, tested with an 10M
> element int array) for 1D-arrays, which i assume is the most common case.
> Even with a 2D-array with a subarray size of 1M there is no difference in
> execution time. i == k should be relatively rare for large arrays, given a
> good prng.

Ok, slower is bad.

>> The random and array shuffling functions share a common state. I'm
>> wondering whether it should ot should not be so. It seems ok, but then ISTM
>> that the documentation suggests implicitely that setseed applies to
>> random() only, which is not the case anymore after the patch.
>
> I really like the idea of a prng state owned by the user, that is used by all
> user facing random functions, but see that their might be concerns about this
> change. I would update the setseed() documentation, if this proposal is
> accepted. Do you think my patch should already include the documentation
> update?

Duno. I'm still wondering what it should do. I'm pretty sure that the
documentation should be clear about a shared seed, if any. I do not think
that departing from the standard is a good thing, either.

>> If more samples are required than the number of elements, it does not error
>> out. I'm wondering whether it should.
>
> The second argument to array_sample() is treated like a LIMIT, rather than
> the actual number of elements. I updated the documentation. My use-case is:
> take max random items from an array of unknown size.

Hmmm. ISTM that the use case of wanting "this many" stuff would make more
sense because it is strictier so less likely to result in unforseen
consequences. On principle I do not like not knowing the output size.

If someone wants a limit, they can easily "LEAST(#1 dim size, other
limit)" to get it, it is easy enough with a strict function.

>> Also, the sampling should not return its result in order when the number of
>> elements required is the full array, ISTM that it should be shuffled there
>> as well.
>
> You are the second person asking for this. It's done. I am thinking about
> ditching array_sample() and replace it with a second signature for
> array_shuffle():
>
> array_shuffle(array anyarray) -> anyarray
> array_shuffle(array anyarray, limit int) -> anyarray
>
> What do you think?

I think that shuffle means shuffle, not partial shuffle, so a different
name seems better to me.

--
Fabien.

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Taka Taka 2022-07-24 23:56:30 Was my question inappropriate for postgres?
Previous Message Martin Kalcher 2022-07-24 19:29:39 Re: [PATCH] Introduce array_shuffle() and array_sample()

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2022-07-24 20:23:39 Re: psql - factor out echo code
Previous Message Martin Kalcher 2022-07-24 19:29:39 Re: [PATCH] Introduce array_shuffle() and array_sample()