Re: postgres_fdw: using TABLESAMPLE to collect remote sample

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: postgres_fdw: using TABLESAMPLE to collect remote sample
Date: 2022-07-16 21:57:44
Message-ID: 3998908.1658008664@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Andres Freund <andres(at)anarazel(dot)de> writes:
> On 2022-02-23 00:51:24 +0100, Tomas Vondra wrote:
>> And here's the slightly simplified patch, without the part adding the
>> unnecessary GetServerVersion() function.

> Doesn't apply anymore: http://cfbot.cputube.org/patch_37_3535.log
> Marked as waiting-on-author.

Here's a rebased version that should at least pass regression tests.

I've not reviewed it in any detail, but:

* I'm not really on board with defaulting to SYSTEM sample method,
and definitely not on board with not allowing any other choice.
We don't know enough about the situation in a remote table to be
confident that potentially-nonrandom sampling is OK. So personally
I'd default to BERNOULLI, which is more reliable and seems plenty fast
enough given your upthread results. It could be an idea to extend the
sample option to be like "sample [ = methodname ]", if you want more
flexibility, but I'd be happy leaving that for another time.

* The code depending on reltuples is broken in recent server versions,
and might give useless results in older ones too (if reltuples =
relpages = 0). Ideally we'd retrieve all of reltuples, relpages, and
pg_relation_size(rel), and do the same calculation the planner does.
Not sure if pg_relation_size() exists far enough back though.

* Copying-and-pasting all of deparseAnalyzeSql (twice!) seems pretty
bletcherous. Why not call that and then add a WHERE clause to its
result, or just add some parameters to it so it can do that itself?

* More attention to updating relevant comments would be appropriate,
eg here you've not bothered to fix the adjacent falsified comment:

/* We've retrieved all living tuples from foreign server. */
- *totalrows = astate.samplerows;
+ if (do_sample)
+ *totalrows = reltuples;
+ else
+ *totalrows = astate.samplerows;

* Needs docs obviously. I'm not sure if the existing regression
testing covers the new code adequately or if we need more cases.

Having said that much, I'm going to leave it in Waiting on Author
state.

regards, tom lane

Attachment Content-Type Size
0001-postgres_fdw-sample-data-on-remote-node-for-20220716.patch text/x-diff 13.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David G. Johnston 2022-07-16 23:50:28 Re: Move Section 9.27.7 (Data Object Management Functions) to System Information Chapter
Previous Message Thomas Munro 2022-07-16 21:56:27 Re: Proposal to introduce a shuffle function to intarray extension