Re: postgres_fdw: using TABLESAMPLE to collect remote sample

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: James Finnerty <jfinnert(at)amazon(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: postgres_fdw: using TABLESAMPLE to collect remote sample
Date: 2023-01-06 14:40:36
Message-ID: 4ee816f5-6146-7745-a6b4-f38a1b24105a@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I've pushed the comment/assert cleanup.

Here's a cleaned up version of the relkind check. This is almost
identical to the patch from yesterday (plus the renames and updates of
comments for modified functions).

The one difference is that I realized the relkind check does not
actually say we can't do sampling - it just means we can't use
TABLESAMPLE to do it. We could still use "random()" ...

Furthermore, I don't think we should silently disable sampling when the
user explicitly requests TABLESAMPLE by specifying bernoulli/system for
the table - IMHO it's less surprising to just fail in that case.

So we now do this:

if (!can_tablesample && (method == ANALYZE_SAMPLE_AUTO))
method = ANALYZE_SAMPLE_RANDOM;

Yes, we may still disable sampling when reltuples is -1, 0 or something
like that. But that's a condition that is expected for new relations and
likely to fix itself, which is not the case for relkind.

Of course, all relkinds that don't support TABLESAMPLE currently have
reltuples value that will disable sampling anyway (e.g. views have -1).
So we won't actually fallback to random() anyway, because we can't
calculate the sample fraction.

That's a bit annoying for foreign tables pointing at a view, which is a
more likely use case than table pointing at a sequence. And likely more
of an issue, because views may return a many rows (while sequences have
only a single row).

But I realized we could actually still do "random()" sampling:

SELECT * FROM t ORDER BY random() LIMIT $X;

where $X is the target number of rows for sample for the table. Which
would result in plans like this (given sufficient work_mem values)

QUERY PLAN
-------------------------------------------------------------------
Limit (actual rows=30000 loops=1)
-> Sort (actual rows=30000 loops=1)
Sort Key: (random())
Sort Method: top-N heapsort Memory: 3916kB
-> Append (actual rows=1000000 loops=1)

Even with lower work_mem values this would likely be a win, due to
saving on network transfers.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
0001-Check-relkind-before-using-TABLESAMPLE-in-postgres_f.patch text/x-patch 7.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ranier Vilela 2023-01-06 14:49:34 Re: Optimizing Node Files Support
Previous Message Andrew Dunstan 2023-01-06 14:40:31 Re: Cygwin cleanup