Re: postgres_fdw: using TABLESAMPLE to collect remote sample

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: postgres_fdw: using TABLESAMPLE to collect remote sample
Date: 2022-07-18 18:45:10
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> writes:
> Thanks. I'll switch this to "needs review" now.

OK, I looked through this, and attach some review suggestions in the
form of a delta patch. (0001 below is your two patches merged, 0002
is my delta.) A lot of the delta is comment-smithing, but not all.

After reflection I think that what you've got, ie use reltuples but
don't try to sample if reltuples <= 0, is just fine. The remote
would only have reltuples <= 0 in a never-analyzed table, which
shouldn't be a situation that persists for long unless the table
is tiny. Also, if reltuples is in error, the way to bet is that
it's too small thanks to the table having been enlarged. But
an error in that direction doesn't hurt us: we'll overestimate
the required sample_frac and pull back more data than we need,
but we'll still end up with a valid sample of the right size.
So I doubt it's worth the complication to try to correct based
on relpages etc. (Note that any such correction would almost
certainly end in increasing our estimate of reltuples. But
it's safer to have an underestimate than an overestimate.)

I messed around with the sample_frac choosing logic slightly,
to make it skip pointless calculations if we decide right off
the bat to disable sampling. That way we don't need to worry
about avoiding zero divide, nor do we have to wonder if any
of the later calculations could misbehave.

I left your logic about "disable if saving fewer than 100 rows"
alone, but I have to wonder if using an absolute threshold rather
than a relative one is well-conceived. Sampling at a rate of
99.9 percent seems pretty pointless, but this code is perfectly
capable of accepting that if reltuples is big enough. So
personally I'd do that more like

if (sample_frac > 0.95)

which is simpler and would also eliminate the need for the previous
range-clamp step. I'm not sure what the right cutoff is, but
your "100 tuples" constant is just as arbitrary.

I rearranged the docs patch too. Where you had it, analyze_sampling
was between fdw_startup_cost/fdw_tuple_cost and the following para
discussing them, which didn't seem to me to flow well at all. I ended
up putting analyze_sampling in its own separate list. You could almost
make a case for giving it its own <sect3>, but I concluded that was
probably overkill.

One thing I'm not happy about, but did not touch here, is the expense
of the test cases you added. On my machine, that adds a full 10% to
the already excessively long runtime of postgres_fdw.sql --- and I
do not think it's buying us anything. It is not this module's job
to test whether bernoulli sampling works on partitioned tables.
I think you should just add enough to make sure we exercise the
relevant code paths in postgres_fdw itself.

With these issues addressed, I think this'd be committable.

regards, tom lane

Attachment Content-Type Size
0001-postgres_fdw-sample-data-tomas.patch text/x-diff 22.1 KB
0002-toms-review-delta.patch text/x-diff 11.3 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2022-07-18 18:48:56 Re: proposal: possibility to read dumped table's name from file
Previous Message Nathan Bossart 2022-07-18 18:30:28 Re: Use "WAL segment" instead of "log segment" consistently in user-facing messages