Quick Links

Re: Should we optimize the `ORDER BY random() LIMIT x` case?

From:	Aleksander Alekseev <aleksander(at)timescale(dot)com>
To:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Cc:	Nico Williams <nico(at)cryptonector(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrei Lepikhov <lepihov(at)gmail(dot)com>, wenhui qiu <qiuwenhuifx(at)gmail(dot)com>, Vik Fearing <vik(at)postgresfriends(dot)org>
Subject:	Re: Should we optimize the `ORDER BY random() LIMIT x` case?
Date:	2025-05-19 10:25:00
Message-ID:	CAJ7c6TMaNgQeYVS0G7fPCQLOSs4xCiHuZ+sGXbRJihTzCcvvaw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Tom, Nico, Vik,

> Seems to me the obvious answer is to extend TABLESAMPLE (or at least, some
> of the tablesample methods) to allow it to work on a subquery.

Currently our BERNOULLI and SYSTEM sampling methods don't allow
specifying the maximum number of rows that should be returned, only
the percentage. So it's going to do the same as:

```
SELECT t.* FROM mytable AS t
INNER JOIN LATERAL (
SELECT 1 WHERE random() < 0.99
) AS r ON TRUE
```

Just to clarify, do you propose to add a RESERVOIR sampling method as well?

> I assume it gathers the whole result and _then_ orders by random(), yes?

Sure.

> I think the obvious thing to do is to have the optimizer recognize
> `ORDER BY random() LIMIT x` and do the reservoir sampling thing as the
> underlying query produces rows. The engine needs to keep track of how
> many rows it's seen so far so that with each new row it can use
> `random() < 1/n` to decide whether to keep the row.

I agree this would be most convenient for the user. Unfortunately this
will require us to check every SELECT query: "oh, isn't it by any
chance ORDER BY random() LIMIT x?". I don't think we can't afford such
a performance degradation, even a small one, for an arguably rare
case.

> > temporary tables using `CREATE TEMP TABLE ... AS SELECT * FROM tbl
> > TABLESAMPLE BERNOULLI(20)` but this is inconvenient and would be
> > suboptimal even if we supported global temporary tables.
>
> Speaking of which, what would it take to have global temp tables?

Considering the fact that several people (me included [1]) tried to
address this issue in several ways for at least the past decade
without success, I would say this is a fairly difficult problem. I
remember seeing a global temp tables patch on the commitfests maybe a
year or two ago, but it's not there anymore. Check the past commitfest
to find out how it ended up.

> TABLESAMPLE is hitched to a <table primary> which can be basically
> anything resembling a relation. So it appears the standard already
> allows this and we just need to implement it.

Vik, many thanks for sharing this. I don't have a strong opinion on
`FETCH SAMPLE FIRST 10 ROWS ONLY` but since TABLESAMPLE should / may
work for subqueries anyway we could focus on it for now.

***

As a separate observation, I noticed that we could do something like this:

```
-- imagine replacing inefficient array_sample(array_agg(t), 10)
-- with more efficient array_sample_reservoir(t, 10)
SELECT (unnest(agg)).* AS k FROM
( SELECT array_sample(array_agg(t), 10) AS agg FROM (
... here goes the subquery ...
) AS t
);
```

... if only we supported such a column expansion for not registered
records. Currently such a query fails with:

```
ERROR: record type has not been registered
```

It has all the needed records, just fails to display them in a form of
separate columns. Note that in this case I don't care if column names
are going to be something like '???' or 'unnamed1'. I can imagine how
it could be convenient in more cases than just this one.

Any thoughts on whether we should support this?

[1]: https://www.postgresql.org/message-id/flat/20160729141552.4062e19b%40fujitsu
--
Best regards,
Aleksander Alekseev

In response to

Re: Should we optimize the `ORDER BY random() LIMIT x` case? at 2025-05-16 22:06:49 from Nico Williams

Responses

Re: Should we optimize the `ORDER BY random() LIMIT x` case? at 2025-05-19 15:38:19 from Nico Williams
Re: Should we optimize the `ORDER BY random() LIMIT x` case? at 2025-05-19 18:04:06 from Andrei Lepikhov
Re: Should we optimize the `ORDER BY random() LIMIT x` case? at 2025-05-19 19:01:39 from Vik Fearing

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Robert Haas	2025-05-19 12:35:05	Re: Violation of principle that plan trees are read-only
Previous Message	Amit Kapila	2025-05-19 09:43:19	Re: Backward movement of confirmed_flush resulting in data duplication.