Re: Optional skipping of unchanged relations during ANALYZE?

From: Robert Treat <rob(at)xzilla(dot)net>
To: Sami Imseih <samimseih(at)gmail(dot)com>
Cc: VASUKI M <vasukianand0119(at)gmail(dot)com>, Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, Christoph Berg <myon(at)debian(dot)org>
Subject: Re: Optional skipping of unchanged relations during ANALYZE?
Date: 2026-01-24 21:36:38
Message-ID: CAJSLCQ3Z9cM2eZNa4aOnLmLyiZmSDrZH2xQm1RfT4PdKWo0ZLg@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jan 23, 2026 at 9:31 PM Sami Imseih <samimseih(at)gmail(dot)com> wrote:
>
> Thanks for the detailed summary!
>
> It is important to point out that this feature is trying to do 2 distinct
> things in 1 command. run analyze under when either one of these conditions
> is true:
>
> 1/ Table has not been analyzed yet.
> 2/ Table has been modified.
>

Maybe this is all an aside, but I don't think that was the vision for
what the OP was trying to do with his patch, in that sense he was
approaching it from a different angle, and I've been reading this
thread trying to decide if people are just talking past each other.
But after thinking about it some more, I think the above might be the
more useful mental model for the discussion.

> > Thanks a lot for the detailed feedback — this has been very helpful.Answering to all mails in one.
> >
> > A few clarifications on intent and scope, and how this relates to the points raised:
> >
> > Autovacuum overlap
> > I agree there is some conceptual overlap with autovacuum’s analyze decision logic.
> > The intent here is not to replace or duplicate autovacuum heuristics, but to reduce
>
> Yes, I agree with this.
>
> > I agree that n_mod_since_analyze == 0 is a very simple condition
> > and not “smart” in the general sense. That is intentional for now.
> > This option is not trying to answer when statistics should be refreshed optimally,
> > but only to skip relations that are known to be unchanged since the last analyze.
> > If even a single tuple is modified, SMART ANALYZE will still re-run, preserving
> > conservative behavior.
>
> Yes, this is my concern. Why would I want to analyze if 1 row or a negligible
> amount of rows are modified? I understand that this feature is trying to
> keep the decision making very simple, but I think it's too simple to actually
> be helpful in addressing the wasted effort of an ANALYZE command.
>
> > Tables never analyzed
> > As Christoph and Ilia pointed out earlier, skipping tables that were never analyzed would be incorrect.
> > The current logic explicitly avoids that by requiring last_analyze or last_autoanalyze to be present
> > before skipping. Tables without prior statistics are always analyzed.
>
> I agree with this, but I think it's more than just tables that have
> not been analyzed.
> What if a new column is added after the last (auto)analyze. Would we not want to
> trigger an analyze in that case?
>

Well, I don't know that we are "triggering" anything, but this is
definitely a case where we have "missing stats".

> > Relation to vacuumdb --missing-stats-only
> > I agree this is related but slightly different in intent. --missing-stats-only
> > answers “does this table have any statistics at all?”, while SMART ANALYZE
> > answers “has this table changed since the last statistics collection?”. Both seem
> > useful, but they target different use cases. I see SMART ANALYZE primarily
> > as a performance optimization for repeated manual ANALYZE runs on mostly-static schemas.
>
> SMART ANALYZE is trying to answer 2 questions "which table does not
> have any statistics at all"
> and "has this table changed since the last statistics collection?”, right?
>
> So, maybe they need to be 2 separate options.
>
> > Although as sami said this SMART is not smart enough as it should be ,
> > I will change name accordingly in the further patches
>
> Yup, I am not too fond of SMART in the name. Also, then name itself
> is vague. SKIP_LOCKED and BUFFER_USAGE_LIMIT on the other
> hand tell you exactly what they[re used for.
>

So, tossing out a new proposal here, which is to offer ANALYZE with 2
new options... MISSING_STATS and MODIFIED_STATS.

When MISSING_STATS is passed, we attempt to analyze only tables that
have missing stats, essentially implementing a version of
--missing-stats-only but for the ANALYZE command. In successive runs,
this should reduce towards a no-op, although we need to decide what to
do about system tables, which, iirc --missing-stats-only always
assumes to be true, but this version probably doesn't want to assume
that.

When MODIFIED_STATS is passed, we would instead only analyze tables
where some threshold of rows has been modified. I feel like the most
obvious choice for this calculation would be based on a formula like
"analyze threshold = analyze base threshold + analyze scale factor *
number of tuples". Astute observers will note that this is the same
threshold used by autoanalyze, which means if you had the same
defaults you are just doing the work manually that autoanalyze would
eventually get around to doing (which seems potentially useful on its
own). But also if these were based on gucs, the OP could modify those
gucs to achieve their desired behavior, ie.
set analyze_base_threshold=1; set analyze_scale_factor=0; analyze
(modified_stats); // this should analyze anything with 1 modified row
Granted, I don't like that it is both more wordy than the original
idea, and that we would need to add new gucs, but this would be pretty
flexible.

Robert Treat
https://xzilla.net

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Aditya Gollamudi 2026-01-24 22:27:04 Re: [PATCH] Refactor *_abbrev_convert() functions
Previous Message David E. Wheeler 2026-01-24 21:29:00 Re: ABI Compliance Checker GSoC Project