Re: Import Statistics in postgres_fdw before resorting to sampling.

From: Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Etsuro Fujita <etsuro(dot)fujita(at)gmail(dot)com>
Subject: Re: Import Statistics in postgres_fdw before resorting to sampling.
Date: 2025-11-03 05:26:42
Message-ID: CADkLM=dSJk8h+14Wi7dGGcO1168R_nCGO2EHRTcC6b2Zu0rhpg@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

>
> I am a bit uncomfortable regarding the design you are using here,
> where the ImportStatistics callback, if defined, takes priority over
> the existing AnalyzeForeignTable, especially regarding the fact that
> both callbacks attempt to retrieve the same data, except that the
> existing callback has a different idea of the timing to use to
> retrieve reltuples and relpages. The original callback
> AnalyzeForeignTable retrieves immediately the total number of pages
> via SQL, to feed ANALYZE. reltuples is then fed to ANALYZE from a
> callback function that that's defined in AnalyzeForeignTable().
>

They don't try to retrieve the same data. AnalyzeForeignTable tries to
retrieve a table sample which it feeds into the normal ANALYZE process. If
that sample is going to be any good, it has to be a lot of rows, that
that's a lot of network traffic.

ImportStatistics just grabs the results that ANALYZE computed for the
remote table, using a far better sample than we'd want to pull across the
wire.

> My point is: rather than trying to implement a second solution with a
> new callback, shouldn't we try to rework the existing callback so as
> it would fit better with what you are trying to do here: feed data
> that ANALYZE would then be in charge of inserting?

To do that, we would have to somehow generate fake data locally from the
pg_stats data that we did pull over the wire, just to have ANALYZE compute
it back down to the data we already had.

> Relying on
> pg_restore_relation_stats() for the job feels incorrect to me knowing
> that ANALYZE is the code path in charge of updating the stats of a
> relation.

That sounds like an argument for expanding ANALYZE to have syntax that will
digest pre-computed rows, essentially taking over what
pg_restore_relation_stats and pg_restore_attribute_stats already do, and
that idea was dismissed fairly early on in development, though I wasn't
against it at the time.

As it is, those two functions were developed to match what ANALYZE does.
pg_restore_relation_stats even briefly had inplace updates because that's
what ANALYZE did.

> The new callback is a shortcut that bypasses what ANALYZE
> does, so the problem, at least it seems to me, is that we want to
> retrieve all the data in a single step, like your new callback, not in
> two steps, something that only the existing callback allows. Hence,
> wouldn't it be more natural to retrieve the total number of pages and
> reltuples from one callback, meaning that for local relations we
> should delay RelationGetNumberOfBlocks() inside the existing
> "acquire_sample_rows" callback (renaming it would make sense)? This
> requires some redesign of AnalyzeForeignTable and the internals of
> analyze.c, but it would let FDW extensions know about the potential
> efficiency gain.
>

I wanted to make minimal disruption to the existing callbacks, but that may
have been misguided.

One problem I do see, however, is that the queries for fetching relation
(pg_class) stats should never fail and always return one row, even if the
values returned are meaningless. It's the query against pg_stats that lets
us know if 1) the database on the other side is a real-enough postgres and
2) the remote table is itself analyzed. Only once we're happy that we have
good attribute stats should we bother with the relation stats. All of this
logic is specific to postgres, so it was confined to postgres_fdw. I don't
know that other FDWs would be that much different, but minimizing the
generic impact was a goal.

I'll look at this again, but I'm not sure I'm going to come up with much
different.

> There is also a performance concern to be made with the patch, but as
> it's an ANALYZE path that may be acceptable: if we fail the first
> callback, then we may call the second callback.
>

I think the big win is the network traffic savings.

>
> Fujita-san, what do you think?

Very interested to know as well.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tristan Partin 2025-11-03 05:43:58 Re: meson's in-tree libpq header search order vs -Dextra_include_dirs
Previous Message Shubham Khanna 2025-11-03 04:40:38 Re: Add support for specifying tables in pg_createsubscriber.