| From: | Corey Huinker <corey(dot)huinker(at)gmail(dot)com> |
|---|---|
| To: | Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com> |
| Cc: | Etsuro Fujita <etsuro(dot)fujita(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, pgsql-hackers(at)postgresql(dot)org, jkatz(at)postgresql(dot)org, nathandbossart(at)gmail(dot)com |
| Subject: | Re: Import Statistics in postgres_fdw before resorting to sampling. |
| Date: | 2026-01-15 17:57:38 |
| Message-ID: | CADkLM=d1HH1JnhCmpD+3v+A=JZ4EeQRa86Ygfaucb1zKpS1vWw@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
>
> Instead of fetch_stats, I would prefer has_stats. But if majority is
> on fetch_stats, I am ok with it.
>
I don't think there are any strong feelings about the names of the
parameters, only their meanings and their defaults.
> I have no strong opinion about whether fetch_stats should be ON or off
> by default. In my simple view, the remote relation that the foreign
> table points is usually a regular table or a partitioned table. So
> default ON makes sense to me. But there's a possibility that in an
> OLAP system most of the foreign objects are views exposing only some
> data and not all. I think defaulting to off is backward compatible and
> requires cautious opt-in.
If the remote object is a postgres view, we're never going to get stats, so
default of fetch_stats = on coupled with automatic failover to regular
sampling would result in some trivial network traffic and some
NOTICE/WARNING log messages until the owner sets fetch_stats = off on the
local table.
> Or else suddenly all ANALYZE on foreign
> server will be costly since they are running a couple queries before
> falling back to sampling.
>
In all scenarios, the remote_analyze setting defaults to false, and the
remote database should continue to have stats going forward from that
point. I can see value in having a more insistent NOTICE/WARNING message
for situations where a remote ANALYZE fails or the table continues to have
inadequate stats after the remote ANALYZE.
> Why do we need remote_analyze as a flag? I might have missed some
> conversation which led to including this flag. We fetch relkind, so we
> know whether the object on the foreign server supports statistics or
> not.
Well, sorta. We know the database on the other end talks libpq and has the
table pg_class. There are more than a few forks, Redshift and Vertica come
to mind, where this check will succeed, but the design diverges after that
and doesn't have pg_statistic or similar. We have to do deeper checks of
the remote system (matching pg_statistic rows to all the va_cols we expect,
etc) to verify that we _probably_ have good stats, but we don't actually
truly know if the stats will actually be there until we fetch and import
them, and at that point we're done. And that is part of my discomfort with
StatisticsAreImportable in that it can identify some of the cases where we
_can't_ import stats, but to identify whether we _can_ import the stats we
have just moved all of the functionality of ImportStatistics into
StatisticsAreImportable.
> If it supports statistics, we run ANALYZE automatically. I mean
> the user wants us to ANALYZE the foreign table and has also indicated
> that the remote object is capable of storing stats. We should document
> the fact that we analyze if we do not find existing statistics. Why
> would users want to disable running ANALYZE? One possibility is they
> don't want to disturb query plans because of suddenly changing
> statistics. But then they will need to switch auto-analysis off. So
> maybe there is a way to know when it's not acceptable to run ANALYZE
> on the remote server. I feel this flag will put another burden on the
> user who already has to set so many flags in postgres_fdw. That is
> especially true, when the flag is useful only the first time when the
> remote object doesn't have stats - usually there will be stats. If at
> all we have to introduce remote_analyze, let it default to the value
> of fetch_stats.
>
I may have answered this question in the above responses, but
remote_analyze does not default to ON, nor do I envision making it the
default. I think the primary purpose for the flag is situations where the
remote database is often "swapped out" with one that has a fresher set of
the data (via DNS swapping, port swapping, etc) and the new database was
put online before vacuum got around to those tables. I would agree that
running the remote analyze is pulling a big lever, hence only trying it if
we failed to find sufficient existing stats.
> If both - importing statistics and fallback analysis - fail, I think
> we should resort to sampling - since the user has asked for an ANALYZE
> table and sampling will achieve that. As you said, giving warnings at
> every failure will nudge the user to set their options correctly.
>
Sorry if my description was confusing, I should have been more clear that
"fallback analysis" is the regular row sampling. In scenario #4 (the one I
advocate and Nathan advocate[d?]), the decision tree is this:
1. if fetch_stats = off OR StatisticsAreImportable = false THEN
resort_to_regular_row_sampling and exit normally.
2. attempt_to_fetch_remote_stats. If we fetched and imported remote stats,
THEN exit normally.
3. if remote_analyze = false THEN resort_to_regular_row_sampling and exit
normally.
4. send_remote_analyze_command, if that fails then THEN WARN and
resort_to_regular_row_sampling and exit normally.
5. (step #2, again, essentially) attempt_to_fetch_remote_stats. If we
fetched and imported remote stats, THEN exit normally.
6. WARN user of potentially intractable
problem, resort_to_regular_row_sampling and exit normally.
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Kirill Reshke | 2026-01-15 17:59:31 | Re: Fix gistkillitems & add regression test to microvacuum |
| Previous Message | Álvaro Herrera | 2026-01-15 17:41:55 | Re: remove the unneeded header file math.h in binaryheap.c |