Re: postgres_fdw: using TABLESAMPLE to collect remote sample

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: James Finnerty <jfinnert(at)amazon(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: postgres_fdw: using TABLESAMPLE to collect remote sample
Date: 2023-01-07 14:07:58
Message-ID: 560c80fc-823e-418d-280e-e0c7c0dad59b@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 1/6/23 23:41, Tomas Vondra wrote:
> On 1/6/23 17:58, Tom Lane wrote:
>> Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> writes:
>>> The one difference is that I realized the relkind check does not
>>> actually say we can't do sampling - it just means we can't use
>>> TABLESAMPLE to do it. We could still use "random()" ...
>>
>>> Furthermore, I don't think we should silently disable sampling when the
>>> user explicitly requests TABLESAMPLE by specifying bernoulli/system for
>>> the table - IMHO it's less surprising to just fail in that case.
>>
>> Agreed on both points. This patch looks good to me.
>>
>
> Good, I'll get this committed.The "ORDER BY random()" idea is a possible
> improvement, can be discussed on it's own.
>

Pushed.

>>> Of course, all relkinds that don't support TABLESAMPLE currently have
>>> reltuples value that will disable sampling anyway (e.g. views have -1).
>>> So we won't actually fallback to random() anyway, because we can't
>>> calculate the sample fraction.
>>> That's a bit annoying for foreign tables pointing at a view, which is a
>>> more likely use case than table pointing at a sequence.
>>
>> Right, that's a case worth being concerned about.
>>
>>> But I realized we could actually still do "random()" sampling:
>>> SELECT * FROM t ORDER BY random() LIMIT $X;
>>
>> Hmm, interesting idea, but it would totally bollix our correlation
>> estimates. Not sure that those are worth anything for remote views,
>> but still...
>
> But isn't that an issue that we already have? I mean, if we do ANALYZE
> on a foreign table pointing to a view, we fetch all the results. But if
> the view does not have a well-defined ORDER BY, a trivial plan change
> may change the order of results and thus the correlation.
>
> Actually, how is a correlation even defined for a view?
>
> It's true this "ORDER BY random()" thing would make it less stable, as
> it would change the correlation on every run. Although - the calculated
> correlation is actually quite stable, because it's guaranteed to be
> pretty close to 0 because we make the order random.
>
> Maybe that's actually better than the current state where it depends on
> the plan? Why not to look at the relkind and just set correlation to 0.0
> in such cases?
>
> But if we want to restore that current behavior (where it depends on the
> actual query plan), we could do something like this:
>
> SELECT * FROM the_remote_view ORDER BY row_number() over ();
>
> But yeah, this makes the remote sampling more expensive. Probably still
> a win because of network costs, but not great.
>

I've been thinking about this a bit more, and I'll consider submitting a
patch for the next CF. IMHO it's probably better to accept correlation
being 0 in these cases - it's more representative of what we know about
the view / plan output (or rather lack of knowledge).

However, maybe views are not the best / most common example to think
about. I'd imagine it's much more common to reference a regular table,
but the table gets truncated / populated quickly, and/or the autovacuum
workers are busy so it takes time to update reltuples. But in this case
it's also quite simple to fix the correlation by just ordering by ctid
(which I guess we might do based on the relkind).

There's a minor issue with partitioned tables, with foreign partitions
pointing to remote views. This is kinda broken, because the sample size
for individual relations is determined based on relpages. But that's 0
for views, so these partitions get ignored when building the sample. But
that's a pre-existing issue.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2023-01-07 15:20:10 Re: postgres_fdw: using TABLESAMPLE to collect remote sample
Previous Message Dean Rasheed 2023-01-07 12:54:50 Re: MERGE ... WHEN NOT MATCHED BY SOURCE