Re: postgres_fdw: using TABLESAMPLE to collect remote sample

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: James Finnerty <jfinnert(at)amazon(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: postgres_fdw: using TABLESAMPLE to collect remote sample
Date: 2023-01-06 22:41:48
Message-ID: 4daee204-341f-d413-fa93-29f603f61da8@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 1/6/23 17:58, Tom Lane wrote:
> Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> writes:
>> The one difference is that I realized the relkind check does not
>> actually say we can't do sampling - it just means we can't use
>> TABLESAMPLE to do it. We could still use "random()" ...
>
>> Furthermore, I don't think we should silently disable sampling when the
>> user explicitly requests TABLESAMPLE by specifying bernoulli/system for
>> the table - IMHO it's less surprising to just fail in that case.
>
> Agreed on both points. This patch looks good to me.
>

Good, I'll get this committed.The "ORDER BY random()" idea is a possible
improvement, can be discussed on it's own.

>> Of course, all relkinds that don't support TABLESAMPLE currently have
>> reltuples value that will disable sampling anyway (e.g. views have -1).
>> So we won't actually fallback to random() anyway, because we can't
>> calculate the sample fraction.
>> That's a bit annoying for foreign tables pointing at a view, which is a
>> more likely use case than table pointing at a sequence.
>
> Right, that's a case worth being concerned about.
>
>> But I realized we could actually still do "random()" sampling:
>> SELECT * FROM t ORDER BY random() LIMIT $X;
>
> Hmm, interesting idea, but it would totally bollix our correlation
> estimates. Not sure that those are worth anything for remote views,
> but still...

But isn't that an issue that we already have? I mean, if we do ANALYZE
on a foreign table pointing to a view, we fetch all the results. But if
the view does not have a well-defined ORDER BY, a trivial plan change
may change the order of results and thus the correlation.

Actually, how is a correlation even defined for a view?

It's true this "ORDER BY random()" thing would make it less stable, as
it would change the correlation on every run. Although - the calculated
correlation is actually quite stable, because it's guaranteed to be
pretty close to 0 because we make the order random.

Maybe that's actually better than the current state where it depends on
the plan? Why not to look at the relkind and just set correlation to 0.0
in such cases?

But if we want to restore that current behavior (where it depends on the
actual query plan), we could do something like this:

SELECT * FROM the_remote_view ORDER BY row_number() over ();

But yeah, this makes the remote sampling more expensive. Probably still
a win because of network costs, but not great.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2023-01-06 23:07:04 Re: New strategies for freezing, advancing relfrozenxid early
Previous Message Tom Lane 2023-01-06 22:31:26 Re: wake up logical workers after ALTER SUBSCRIPTION