| From: | Corey Huinker <corey(dot)huinker(at)gmail(dot)com> |
|---|---|
| To: | Etsuro Fujita <etsuro(dot)fujita(at)gmail(dot)com> |
| Cc: | Michael Paquier <michael(at)paquier(dot)xyz>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, jkatz(at)postgresql(dot)org, nathandbossart(at)gmail(dot)com |
| Subject: | Re: Import Statistics in postgres_fdw before resorting to sampling. |
| Date: | 2026-01-07 06:04:34 |
| Message-ID: | CADkLM=ezV0_oUCxK=A2TmaAuXvWs0+i-H8fSNU3BRY=dBuDhQQ@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
>
> > I agree, if there is no fallback, then the default should be false. When
> I was initially brainstorming this patch, Nathan Bossart had suggested
> making it the default because 1) that would be an automatic benefit to
> users and 2) the cost for attempting to import stats was small in
> comparison to a table stample, so it was worth the attempt. I still want
> users to get that automatic benefit, but if there is no fallback to
> sampling then the default only makes sense as false.
>
> I think that the FDW API that I proposed could actually allow us to
> fall back to sampling, by modifying StatisticsAreImportable so that it
> also checks if 1) there are statistics on the remote server and 2) the
> data is fresh enough, and if so, returns true; otherwise, returns
> false; in the latter case we could fall back to sampling. And if we
> modified it as such, I think we could change the default to true.
> (Checking #2 is necessary to avoid importing stale data, which would
> degrade plan quality.)
I've given this some more thought.
First, we'd have to add the va_cols param to StatisticsAreImportable, which
isn't itself terrible.
Then, we'd have to determine that there are stats available for every
mapped column (filtered by va_cols, if any). But the only way to do that is
to query the pg_stats view on the remote end, and if we have done that,
then we've already fetched the stats. Yes, we could avoid selecting the
actual statistical values, and that would save some network bandwidth at
the cost of having to do the query again with stats. So I don't really see
the savings.
Also, the pg_stats view is our security-barrier black box into statistics,
and it gives no insight into how recently those stats were acquired. We
could poke pg_stat_all_tables, assuming we can even query it, and then make
a value judgement on the value of (CURRENT_TIMESTAMP -
GREATEST(last_analyze, last_autoanalyze), but that value is highly
subjective.
I suppose we could move all of the statistics fetching
into StatisticsAreImportable, And carry those values forward if they are
satisfactory. That would leave ImportStatistics() with little to do other
than form up the calls to pg_restore_*_stats(), but those could still fail,
and at that point we'd have no way to fall back to sampling and analysis.
I really want to make sampling fallback possible.
Anyway, here's v8, incorporating the documentation feedback and Matheus's
notes.
| Attachment | Content-Type | Size |
|---|---|---|
| v8-0001-Add-remote-statistics-fetching-to-postgres_fdw.patch | text/x-patch | 42.1 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Dewei Dai | 2026-01-07 06:09:57 | Re: Is this a memory leak in libpqrcv_processTuples()? |
| Previous Message | sunil s | 2026-01-07 06:04:20 | Re: Avoid corrupting DefElem nodes when parsing publication_names and publish options |