Re: pgsql_fdw, FDW for PostgreSQL server

From: Shigeru HANADA <shigeru(dot)hanada(at)gmail(dot)com>
To: Albe Laurenz <laurenz(dot)albe(at)wien(dot)gv(dot)at>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Hitoshi Harada <umi(dot)tanuki(at)gmail(dot)com>
Subject: Re: pgsql_fdw, FDW for PostgreSQL server
Date: 2012-03-28 12:25:45
Message-ID: 4F730349.6040000@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

(2012/03/28 16:18), Albe Laurenz wrote:
> I wrote:
>>> How about getting # of rows estimate by executing EXPLAIN for
>>> fully-fledged remote query (IOW, contains pushed-down WHERE clause),
> and
>>> estimate selectivity of local filter on the basis of the statistics
>>> which are generated by FDW via do_analyze_rel() and FDW-specific
>>> sampling function? In this design, we would be able to use quite
>>> correct rows estimate because we can consider filtering stuffs done
> on
>>> each side separately, though it requires expensive remote EXPLAIN for
>>> each possible path.
>>
>> That sounds nice.
>
> ... but it still suffers from the problems of local statistics
> for remote tables I pointed out.

I guess that you mean about these issues you wrote in earlier post, so
I'd like to comment on them.

> - You have an additional maintenance task if you want to keep
> statistics for remote tables accurate. I understand that this may
> get better in a future release.

I'm not sure that's what you meant, but we need to execute remote
ANALYZE before calling pgsql_fdw_analyze() to keep local statistics
accurate. IMO DBAs are responsible to execute remote ANALYZE at
appropriate timing, so pgsql_fdw_analyze (or handler function for
ANALYZE) should just collect statistics from remote side.

> - You depend on the structure of pg_statistic, which means a potential
> incompatibility between server versions. You can add cases to
> pgsql_fdw_analyze to cater for changes, but that is cumbersome and
> will
> only help for later PostgreSQL versions connecting to earlier ones.

Indeed. Like pg_dump, pgsql_fdw should aware of different server
version if we choose copying statistics. Difference of catalog
structure is very easy to track and cope with, but if meanings of values
or the way to calculate statistics are changed, pgsql_fdw would need
very complex codes to convert values from different version.
I don't know such example, but IMO we should assume that statistics are
valid for only same version (at least major version). After all, I'd
prefer collecting sample data by pgsql_fdw and leaving statistics
generation to local backend.

> - Planning and execution will change (improve, of course) between server
> versions. The local planner may choose an inferior plan based on a
> wrong assumption of how a certain query can be handled on the remote.

Hm, I don't worry about detail of remote planning so much, because
remote server would do its best for a query given by pgsql_fdw. Also
local planner would do its best for given estimation (rows, width and
costs). One concern is that remote cost factors might be different from
local's, so FDW option which represents cost conversion coefficient (1.0
means that remote cost has same weight as local) might be useful.

> - You have no statistics if the foreign table points to a view on the
> remote system.

ISTM that this would be enough reason to give up copying remote stats to
local. We don't provide SELECT push-down nor GROUP BY push-down at
present, so users would want to create views which contain function call
in SELECT clauses.

> I think that these shortcomings are not justified by the gain of
> one client-server round trip less during planning. I'd prefer
> if pgsql_fdw were not dependent on remote statistics stored in the
> local database.

I too prefer if pgsql_fdw doesn't fully depend on statistics of foreign
data, but IMO having statistics of foreign data which were calculated in
the way same as local data seems still useful for estimation about local
filtering. Even if we have no statistics of foreign data on local side,
still we would be able to create plans on the basis of default
selectivity for each expression, as same as regular tables.

Regards,
--
Shigeru HANADA

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2012-03-28 12:40:15 Re: PATCH: pg_basebackup (missing exit on error)
Previous Message Robert Haas 2012-03-28 12:19:37 Re: [COMMITTERS] pgsql: pg_test_timing utility, to measure clock monotonicity and timing