Quick Links

postgres_fdw: using TABLESAMPLE to collect remote sample

From:	Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	postgres_fdw: using TABLESAMPLE to collect remote sample
Date:	2022-02-11 17:02:37
Message-ID:	151e835f-55d6-ddbc-b5b4-07ee606aba44@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

here's a small patch modifying postgres_fdw to use TABLESAMPLE to
collect sample when running ANALYZE on a foreign table. Currently the
sampling happens locally, i.e. we fetch all data from the remote server
and then pick rows. But that's hugely expensive for large relations
and/or with limited network bandwidth, of course.

Alexander Lepikhov mentioned this in [1], but it was actually proposed
by Stephen in 2020 [2] but no patch even materialized.

So here we go. The patch does a very simple thing - it uses TABLESAMPLE
to collect/transfer just a small sample from the remote node, saving
both CPU and network.

And indeed, that helps quite a bit:

---------------------------------------------------------------------
create table t (a int);
Time: 10.223 ms

insert into t select i from generate_series(1,10000000) s(i);
Time: 552.207 ms

analyze t;
Time: 310.445 ms

CREATE FOREIGN TABLE ft (a int)
SERVER foreign_server
OPTIONS (schema_name 'public', table_name 't');
Time: 4.838 ms

analyze ft;
WARNING: SQL: DECLARE c1 CURSOR FOR SELECT a FROM public.t TABLESAMPLE
SYSTEM(0.375001)
Time: 44.632 ms

alter foreign table ft options (sample 'false');
Time: 4.821 ms

analyze ft;
WARNING: SQL: DECLARE c1 CURSOR FOR SELECT a FROM public.t
Time: 6690.425 ms (00:06.690)
---------------------------------------------------------------------

6690ms without sampling, and 44ms with sampling - quite an improvement.
Of course, the difference may be much smaller/larger in other cases.

Now, there's a couple issues ...

Firstly, the FDW API forces a bit strange division of work between
different methods and duplicating some of it (and it's already mentioned
in postgresAnalyzeForeignTable). But that's minor and can be fixed.

The other issue is which sampling method to use - we have SYSTEM and
BERNOULLI built in, and the tsm_system_rows as an extension (and _time,
but that's not very useful here). I guess we'd use one of the built-in
ones, because that'll work on more systems out of the box.

But that leads to the main issue - determining the fraction of rows to
sample. We know how many rows we want to sample, but we have no idea how
many rows there are in total. We can look at reltuples, but we can't be
sure how accurate / up-to-date that value is.

The patch just trusts it unless it's obviously bogus (-1, 0, etc.) and
applies some simple sanity checks, but I wonder if we need to do more
(e.g. look at relation size and adjust reltuples by current/relpages).

FWIW this is yet a bit more convoluted when analyzing partitioned table
with foreign partitions, because we only ever look at relation sizes to
determine how many rows to sample from each partition.

regards

[1]
https://www.postgresql.org/message-id/bdb0bea2-a0da-1f1d-5c92-96ff90c198eb%40postgrespro.ru

[2]
https://www.postgresql.org/message-id/20200829162231.GE29590%40tamriel.snowman.net

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment	Content-Type	Size
postgres-fdw-analyze-sample-20220211.patch	text/x-patch	9.6 KB

Responses

Re: postgres_fdw: using TABLESAMPLE to collect remote sample at 2022-02-11 17:39:43 from Tom Lane
Re: postgres_fdw: using TABLESAMPLE to collect remote sample at 2022-02-16 12:56:58 from Sofia Kopikova

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tomas Vondra	2022-02-11 17:06:10	Re: the build farm is ok, but not the hippopotamus (or the jay)
Previous Message	Bharath Rupireddy	2022-02-11 16:55:49	Re: Assertion failure in WaitForWALToBecomeAvailable state machine