Re: Optional skipping of unchanged relations during ANALYZE?

From: VASUKI M <vasukianand0119(at)gmail(dot)com>
To: Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>
Cc: Sami Imseih <samimseih(at)gmail(dot)com>, Robert Treat <rob(at)xzilla(dot)net>, Andreas Karlsson <andreas(at)proxel(dot)se>, Corey Huinker <corey(dot)huinker(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, Christoph Berg <myon(at)debian(dot)org>
Subject: Re: Optional skipping of unchanged relations during ANALYZE?
Date: 2026-03-02 12:48:05
Message-ID: CAE2r8H7hYGYi4QM85Q7bxs4RbT0Vn63c9ONFbwhAjuSGDzah_A@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello Robert, Sami, Ilia, and everyone,

Thank you all for the detailed review and thoughtful feedback. I have
carefully gone through the comments and revised the patch accordingly. I
truly appreciate the guidance — it helped clarify both the design and the
implementation details.

Regarding Robert’s point about the two distinct use cases, I agree that
MISSING_STATS_ONLY and MODIFIED_STATS represent separate concerns.
MISSING_STATS_ONLY is catalog-driven and persistent in nature, ensuring
that newly created tables, newly added columns, and newly defined extended
statistics are not left without statistics. MODIFIED_STATS, on the other
hand, is more closely aligned with modification thresholds and
autoanalyze-like behavior, which is transient and threshold-based.

Keeping these concerns separate makes the semantics clearer and easier to
reason about. In particular, as Robert mentioned, MISSING_STATS_ONLY must
remain a standalone flag for scenarios such as integration into vacuumdb,
where ensuring that missing statistics are generated is the primary goal.

To align with the CommitFest process, I have created a separate entry for
this feature here:[1]

I will treat MODIFIED_STATS as a separate patch and discussion thread.

Addressing Sami’s technical comments:

1. Duplicate examine_attribute() calls

You were absolutely correct that the earlier version of the patch caused
examine_attribute() to be invoked twice for each attribute: once in
relation_has_missing_column_stats() and again during normal attribute
processing. That approach was inefficient and awkward.
In the revised version, I have removed the early relation-level column
scanning logic. Instead, the missing-statistics check is now integrated
directly inside examine_attribute(). The function now takes a boolean
missing_stats_only parameter. When this option is specified,
examine_attribute() performs the pg_statistic lookup for that specific
attribute. If statistics already exist, the function simply returns NULL,
causing the column to be skipped naturally.

This ensures:
-Each attribute is examined only once.
-The missing-statistics logic fits naturally into the existing flow.
-We avoid redundant catalog lookups.

Inside do_analyze_rel(), after collecting vacattrstats, we now check
whether:
-attr_cnt == 0
-no expression index attributes remain
-and there are no missing extended statistics
If all of these are true under MISSING_STATS_ONLY, the relation is skipped
cleanly.This follows the structure you suggested and simplifies the overall
design.

2. Test coverage

I have expanded the regression tests to cover the scenarios you
mentioned.This ensures that MISSING_STATS_ONLY behaves correctly across
more complex schema configurations.

3. Logging behavior

Thank you for the guidance on logging.I have removed the DEBUG-level elog()
calls introduced in earlier versions. Logging is now consistent with
existing ANALYZE behavior and uses ereport().
When VERBOSE mode is enabled and a relation is skipped due to
MISSING_STATS_ONLY, the output now includes a single INFO-level message of
the form:

INFO: Skipping analyzing "database.namespace.relation"

This matches the style used elsewhere in do_analyze_rel() and avoids
unnecessary additional details when no statistics are collected.There is no
additional logging when a relation is processed normally, keeping behavior
aligned with existing ANALYZE semantics.

Regarding autoanalyze and MODIFIED_STATS
As Ilia clarified, autoanalyze decisions are purely threshold-driven and do
not include missing-statistics logic. That reinforces the decision to keep
MISSING_STATS_ONLY separate and explicit.[2]

If there are further suggestions regarding semantics, naming, or additional
edge cases that should be tested, I would be very happy to incorporate them.
Thank you again for your time and detailed review.

Regards,
Vasuki M
C-DAC,Chennai

[1] https://commitfest.postgresql.org/patch/6516/
[2]
https://www.postgresql.org/message-id/flat/aZSm77WEh8pxQYtf%40nathan#1ab5c06a7d2247d90e71fb995fa21a39

>

Attachment Content-Type Size
v6-0001-ANALYZE-add-MISSING_STATS_ONLY-option.patch text/x-patch 16.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Geier 2026-03-02 12:50:37 Re: Convert NOT IN sublinks to anti-joins when safe
Previous Message Tatsuya Kawata 2026-03-02 12:47:23 Re: [Patch]Add tab completion for DELETE ... USING