From: | Nico Williams <nico(at)cryptonector(dot)com> |
---|---|
To: | Aleksander Alekseev <aleksander(at)timescale(dot)com> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Should we optimize the `ORDER BY random() LIMIT x` case? |
Date: | 2025-05-16 22:06:49 |
Message-ID: | aCe2+fMDvTzZ0zTI@ubby |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Thu, May 15, 2025 at 02:41:15AM +0300, Aleksander Alekseev wrote:
> SELECT t.ts, t.city, t.temperature, h.humidity
> FROM temperature AS t
> LEFT JOIN LATERAL
> ( SELECT * FROM humidity
> WHERE city = t.city AND ts <= t.ts
> ORDER BY ts DESC LIMIT 1
> ) AS h ON TRUE
> WHERE t.ts < '2022-01-05';
> ```
>
> One can do `SELECT (the query above) ORDER BY random() LIMIT x` but
> this produces an inefficient plan. Alternatively one could create
I assume it gathers the whole result and _then_ orders by random(), yes?
I think the obvious thing to do is to have the optimizer recognize
`ORDER BY random() LIMIT x` and do the reservoir sampling thing as the
underlying query produces rows. The engine needs to keep track of how
many rows it's seen so far so that with each new row it can use
`random() < 1/n` to decide whether to keep the row.
> temporary tables using `CREATE TEMP TABLE ... AS SELECT * FROM tbl
> TABLESAMPLE BERNOULLI(20)` but this is inconvenient and would be
> suboptimal even if we supported global temporary tables.
Speaking of which, what would it take to have global temp tables?
> 1. Do you think there might be value in addressing this issue?
Yes! I think `ORDER BY random() LIMIT x` is succinct syntax, and
semantically it should yield as uniformly distributed a result set as
possible even when the size of the underlying query is not knowable a
priori. And it should be as optimal as possible considering that the
underlying query's result set is in principle -and many useful real use
cases- unbounded.
And reservoir sampling is a suitable existing technique to get an
optimal and uniform random finite sampling of an indefinite size row
set.
> 2. If yes, how would you suggest addressing it from the UI point of
> view - by adding a special syntax, some sort of aggregate function, or
> ...?
`ORDER BY random() LIMIT x` is a) in the standard, b) semantically what
you're asking for, c) succint syntax, d) implementable optimally,
therefore IMO `ORDER BY random() LIMIT x` is the correct syntax.
Nico
--
From | Date | Subject | |
---|---|---|---|
Next Message | Regina Obe | 2025-05-16 22:18:52 | pg_upgrade ability to create extension from scripts |
Previous Message | Paul A Jungwirth | 2025-05-16 21:57:07 | Foreign key isolation tests |