Re: Autovacuum on partitioned table (autoanalyze)

From: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, yuzuko <yuzukohosoya(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, David Steele <david(at)pgmasters(dot)net>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, Daniel Gustafsson <daniel(at)yesql(dot)se>, Amit Langote <amitlangote09(at)gmail(dot)com>, Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>, Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Greg Stark <stark(at)mit(dot)edu>
Subject: Re: Autovacuum on partitioned table (autoanalyze)
Date: 2021-08-11 22:33:07
Message-ID: 202108112233.xpz74i5pkgkr@alvherre.pgsql
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

After thinking about the described issues for a while, my proposal is to
completely revamp the way this feature works. See below.

Now, the proposal seems awfully invasive, but it's *the* way I see to
avoid the pgstat traffic. For pg14, maybe we can live with it, and just
use the smaller patches that Horiguchi-san and I have posted, which
solve the other issues; also, Euler Taveira suggested that we could add
a reloption to turn the feature off completely for some tables (maybe
make it off by default and have a reloption to turn it on for specific
partition hierarchies), so that it doesn't cause unduly pain for people
with large partitioning hierarchies.

* PgStat_StatTabEntry gets a new "Oid reportAncestorOid" member. This is
the OID of a single partitioned ancestor, to which the changed-tuple
counts are propagated up.
Normally this is the topmost ancestor; but if the user wishes some
intermediate ancestor to receive the counts they can use
ALTER TABLE the_intermediate_ancestor SET (autovacuum_enabled=on).

* Corollary 1: for the normal case of single-level partitioning, the
parent partitioned table behaves as currently.

* Corollary 2: for multi-level partitioning with no especially
configured intermediate ancestors, only the leaf partitions and the
top-level partitioned table will be analyzed. Intermediate ancestors
are ignored by autovacuum.

* Corollary 3: for multi-level partitioning with some intermediate
ancestor(s) marked as autovacuum_enabled=on, that ancestor will
receive all the counts from all of its partitions, so it will get
analyzed itself; and it'll also forward those counts up to its

we send a message to collector with the analyze-ancestor OID.

* Backends running manual ANALYZE as well as autovacuum will examine
each table's "relispartition" flag and its pgstat table entry; if it
is a partition and doesn't have reportAncestorOid set, determine which
ancestor should analyze counts be reported to; include this OID in the
regular PgStat_MsgAnalyze. This fixes the situation after a crash or
other stats reset. Also, it's not unduly expensive to do, because
it's only in the rare case that the value sent by ATTACH was lost.

* Possible race condition in the previous step may cause multiple
backends to send the same info. Not a serious problem so we don't try
to handle it.

* When tuple change counts for a partition are received by
pgstat_recv_tabstat, they are propagated up to the indicated parent
table in addition to being saved in the table itself.
(Bonus points: when a table is attached or detached as a partition,
the live tuples count is propagated to the newly acquired parent.)

What do people think of this?

Álvaro Herrera 39°49'30"S 73°17'W —

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2021-08-11 22:45:47 Re: Use extended statistics to estimate (Var op Var) clauses
Previous Message Thomas Munro 2021-08-11 22:24:09 Re: Worth using personality(ADDR_NO_RANDOMIZE) for EXEC_BACKEND on linux?