Re: Autovacuum on partitioned table (autoanalyze)

From: Andres Freund <andres(at)anarazel(dot)de>
To: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Peter Geoghegan <pg(at)bowt(dot)ie>, Michael Paquier <michael(at)paquier(dot)xyz>
Cc: yuzuko <yuzukohosoya(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, David Steele <david(at)pgmasters(dot)net>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, Daniel Gustafsson <daniel(at)yesql(dot)se>, Amit Langote <amitlangote09(at)gmail(dot)com>, Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>, Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Greg Stark <stark(at)mit(dot)edu>
Subject: Re: Autovacuum on partitioned table (autoanalyze)
Date: 2021-07-30 01:03:55
Message-ID: 20210730010355.6yodvn2ag3arfihi@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

CCing RMT because I think we need to do something about this for v14.

On 2021-07-27 19:23:42 -0700, Andres Freund wrote:
> On 2021-07-22 13:54:58 -0700, Andres Freund wrote:
> > On 2021-04-08 01:20:14 -0400, Alvaro Herrera wrote:
> > > On 2021-Apr-07, Alvaro Herrera wrote:
> > >
> > > > OK, I bit the bullet and re-did the logic in the way I had proposed
> > > > earlier in the thread: do the propagation on the collector's side, by
> > > > sending only the list of ancestors: the collector can read the tuple
> > > > change count by itself, to add it to each ancestor. This seems less
> > > > wasteful. Attached is v16 which does it that way and seems to work
> > > > nicely under my testing.
> > >
> > > Pushed with this approach. Thanks for persisting with this.
> >
> > I'm looking at this in the context of rebasing & polishing the shared
> > memory stats patch.
> >
> > I have a few questions / concerns:
>
> Another one, and I think this might warrant thinking about for v14:
>
> Isn't this going to create a *lot* of redundant sampling? Especially if you
> have any sort of nested partition tree. In the most absurd case a partition
> with n parents will get sampled n times, solely due to changes to itself.
>
> Look at the following example:
>
> BEGIN;
> DROP TABLE if exists p;
> CREATE TABLE p (i int) partition by range(i);
> CREATE TABLE p_0 PARTITION OF p FOR VALUES FROM ( 0) to (5000) partition by range(i);
> CREATE TABLE p_0_0 PARTITION OF p_0 FOR VALUES FROM ( 0) to (1000);
> CREATE TABLE p_0_1 PARTITION OF p_0 FOR VALUES FROM (1000) to (2000);
> CREATE TABLE p_0_2 PARTITION OF p_0 FOR VALUES FROM (2000) to (3000);
> CREATE TABLE p_0_3 PARTITION OF p_0 FOR VALUES FROM (3000) to (4000);
> CREATE TABLE p_0_4 PARTITION OF p_0 FOR VALUES FROM (4000) to (5000);
> -- create some initial data
> INSERT INTO p select generate_series(0, 5000 - 1) data FROM generate_series(1, 100) reps;
> COMMIT;
>
> UPDATE p_0_4 SET i = i;
>
>
> Whenever the update is executed, all partitions will be sampled at least twice
> (once for p and once for p_0), with p_0_4 sampled three times.
>
> Of course, this is an extreme example, but it's not hard to imagine cases
> where v14 will cause the number of auto-analyzes increase sufficiently to bog
> down autovacuum to a problematic degree.
>
>
> Additionally, while analyzing all child partitions for a partitioned tables
> are AccessShareLock'ed at once. If a partition hierarchy has more than one
> level, it actually is likely that multiple autovacuum workers will end up
> processing the ancestors separately. This seems like it might contribute to
> lock exhaustion issues with larger partition hierarchies?

I started to write a patch rejiggering autovacuum.c portion of this
change. While testing it I hit the case of manual ANALYZEs leaving
changes_since_analyze for partitioned tables in a bogus state - without a
minimally invasive way to fix that. After a bit of confused staring I realized
that the current code has a very similar problem:

Using the same setup as above:

INSERT INTO p VALUES (0,0); /* repeat as many times as desired */
ANALYZE p_0_0;

At this point the system will have lost track of the changes to p_0_0, unless
an autovacuum worker was launched between the INSERTs and the ANALYZE (which
would cause pgstat_report_anl_ancestors() to report the change count upwards).

There appears to be code trying to address that, but I don't see how it
ever does anything meaningful?

/*
* Now report ANALYZE to the stats collector. For regular tables, we do
* it only if not doing inherited stats. For partitioned tables, we only
* do it for inherited stats. (We're never called for not-inherited stats
* on partitioned tables anyway.)
*
* Reset the changes_since_analyze counter only if we analyzed all
* columns; otherwise, there is still work for auto-analyze to do.
*/
if (!inh || onerel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
pgstat_report_analyze(onerel, totalrows, totaldeadrows,
(va_cols == NIL));

/*
* If this is a manual analyze of all columns of a permanent leaf
* partition, and not doing inherited stats, also let the collector know
* about the ancestor tables of this partition. Autovacuum does the
* equivalent of this at the start of its run, so there's no reason to do
* it there.
*/
if (!inh && !IsAutoVacuumWorkerProcess() &&
(va_cols == NIL) &&
onerel->rd_rel->relispartition &&
onerel->rd_rel->relkind == RELKIND_RELATION &&
onerel->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
{
pgstat_report_anl_ancestors(RelationGetRelid(onerel));
}

The pgstat_report_analyze() triggers pgstat_recv_analyze() to reset the
counter that pgstat_recv_anl_ancestors() would use to report changes
upwards:

/*
* If commanded, reset changes_since_analyze to zero. This forgets any
* changes that were committed while the ANALYZE was in progress, but we
* have no good way to estimate how many of those there were.
*/
if (msg->m_resetcounter)
{
tabentry->changes_since_analyze = 0;
tabentry->changes_since_analyze_reported = 0;
}

And if one instead inverts the order of pgstat_report_analyze() and
pgstat_report_anl_ancestors() one gets a slightly different problem: A manual
ANALYZE of the partition root results in the partition root having a non-zero
changes_since_analyze afterwards. expand_vacuum() causes child partitions to be
added to the list of relations, which *first* causes the partition root to be
analyzed, and *then* partitions. The partitions then report their
changes_since_analyze upwards.

I don't think the code as is is fit for v14. It looks like it was rewritten
with a new approach just before the freeze ([1]), and as far as I can tell the
concerns I quoted above weren't even discussed in the whole thread. Alvaro,
any comments?

Greetings,

Andres Freund

[1] https://www.postgresql.org/message-id/20210408032235.GA6842%40alvherre.pgsql

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message John Naylor 2021-07-30 01:12:33 Re: speed up verifying UTF-8
Previous Message Michael Paquier 2021-07-30 00:48:47 Re: Replace l337sp34k in comments.