Re: Import Statistics in postgres_fdw before resorting to sampling.

From: Etsuro Fujita <etsuro(dot)fujita(at)gmail(dot)com>
To: Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Import Statistics in postgres_fdw before resorting to sampling.
Date: 2025-11-21 18:35:17
Message-ID: CAPmGK15Lk0ynP_b46sgfV1pmByzWUKVQoZBSV8fbN0rKb_6AnA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Corey,

This is an important feature. Thanks for working on it.

On Mon, Nov 3, 2025 at 2:26 PM Corey Huinker <corey(dot)huinker(at)gmail(dot)com> wrote:

>> My point is: rather than trying to implement a second solution with a
>> new callback, shouldn't we try to rework the existing callback so as
>> it would fit better with what you are trying to do here: feed data
>> that ANALYZE would then be in charge of inserting?

> To do that, we would have to somehow generate fake data locally from the pg_stats data that we did pull over the wire, just to have ANALYZE compute it back down to the data we already had.

>> Relying on
>> pg_restore_relation_stats() for the job feels incorrect to me knowing
>> that ANALYZE is the code path in charge of updating the stats of a
>> relation.

> That sounds like an argument for expanding ANALYZE to have syntax that will digest pre-computed rows, essentially taking over what pg_restore_relation_stats and pg_restore_attribute_stats already do, and that idea was dismissed fairly early on in development, though I wasn't against it at the time.
>
> As it is, those two functions were developed to match what ANALYZE does. pg_restore_relation_stats even briefly had inplace updates because that's what ANALYZE did.

To me it seems like a good idea to rely on pg_restore_relation_stats
and pg_restore_attribute_stats, rather than doing some rework on
analyze.c; IMO I don't think it's a good idea to do such a thing for
something rather special like this.

>> The new callback is a shortcut that bypasses what ANALYZE
>> does, so the problem, at least it seems to me, is that we want to
>> retrieve all the data in a single step, like your new callback, not in
>> two steps, something that only the existing callback allows. Hence,
>> wouldn't it be more natural to retrieve the total number of pages and
>> reltuples from one callback, meaning that for local relations we
>> should delay RelationGetNumberOfBlocks() inside the existing
>> "acquire_sample_rows" callback (renaming it would make sense)? This
>> requires some redesign of AnalyzeForeignTable and the internals of
>> analyze.c, but it would let FDW extensions know about the potential
>> efficiency gain.

> I wanted to make minimal disruption to the existing callbacks, but that may have been misguided.

+1 for the minimal disruption.

Other initial comments:

The commit message says:

This is managed via two new options, fetch_stats and remote_analyze,
both are available at the server level and table level. If fetch_stats
is true, then the ANALYZE command will first attempt to fetch statistics
from the remote table and import those statistics locally.

If remote_analyze is true, and if the first attempt to fetch remote
statistics found no attribute statistics, then an attempt will be made
to ANALYZE the remote table before a second and final attempt to fetch
remote statistics.

If no statistics are found, then ANALYZE will fall back to the normal
behavior of sampling and local analysis.

I think the first step assumes that the remote stats are up-to-date;
if they aren't, it would cause a regression. (If the remote relation
is a plain table, they are likely to be up-to-date, but for example,
if it is a foreign table, it's possible that they are stale.) So how
about making it the user's responsibility to make them up-to-date? If
doing so, we wouldn't need to do the second and third steps anymore,
making the patch simple.

On the other hand:

This operation will only work on remote relations that can have stored
statistics: tables, partitioned tables, and materialized views. If the
remote relation is a view then remote fetching/analyzing is just wasted
effort and the user is better of setting fetch_stats to false for that
table.

I'm not sure the waste effort is acceptable; IMO, if the remote table
is a view, I think that the system should detect that in some way, and
then just do the normal ANALYZE processing.

That's it for now.

My apologies for the delayed response.

Best regards,
Etsuro Fujita

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2025-11-21 18:49:25 Re: Adding basic NUMA awareness
Previous Message Jacob Champion 2025-11-21 18:21:41 Re: RFC 9266: Channel Bindings for TLS 1.3 support