Re: Autovacuum on partitioned table (autoanalyze)

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: David Steele <david(at)pgmasters(dot)net>, yuzuko <yuzukohosoya(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Justin Pryzby <pryzby(at)telsasoft(dot)com>
Cc: Daniel Gustafsson <daniel(at)yesql(dot)se>, Amit Langote <amitlangote09(at)gmail(dot)com>, Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>, Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Greg Stark <stark(at)mit(dot)edu>
Subject: Re: Autovacuum on partitioned table (autoanalyze)
Date: 2021-03-30 02:09:44
Message-ID: 840e6c56-4e05-8cdd-5478-9f8ad8130c63@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

I took a look at this patch. It does not apply because of 5f8727f5a67,
so a rebase is needed. But I want to talk about the general approach in
general, so it does not matter.

The thread is fairly long, both in terms of number of messages and time
(started in 2019), so let me restate my understanding of the problem and
what the patch aims to do.

The problem is that autovacuum never analyzes non-leaf relations in
partition hierarchies, because they never get modified and so the value
of changes_since_analyze remains 0. This applies both to partitioning
based on inheritance and the new fancy declarative partitioning. The
consequence is that we never have accurate statistics (MCV, histograms
and so on) for the parent, which may lead to poor query plans in cases
when we don't use the child statistics for some reason.

The patch aims for fix that by propagating the changes_since_analyze to
the parent relations, so that the autovacuum properly considers if those
non-leaf relations need analyze.

I think the goal is right, and propagating the changes_since_analyze is
the right solution, but as coded it has a couple issues that may cause
trouble in practice.

Firstly, the patch propagates the changes_since_analyze values from
do_analyze_rel, i.e. from the worker after it analyzes the relation.
That may easily lead to cases with unnecessary analyzes - consider a
partitioned with 4 child relations:

p1 [reltuples=1M, changes_since_analyze=400k]
p2 [reltuples=1M, changes_since_analyze=90k]
p3 [reltuples=1M, changes_since_analyze=90k]
p4 [reltuples=1M, changes_since_analyze=90k]

With the default analyze threshold (10%) this means autoanalyze of p1,
and then (in the next round) analyze of the whole partitioned table,
because 400k is 10% of 4M. So far so good - we're now in this state:

p1 [reltuples=1M, changes_since_analyze=0]
p2 [reltuples=1M, changes_since_analyze=90k]
p3 [reltuples=1M, changes_since_analyze=90k]
p4 [reltuples=1M, changes_since_analyze=90k]

Let's do ~310k more modifications to p2:

p1 [reltuples=1M, changes_since_analyze=0]
p2 [reltuples=1M, changes_since_analyze=400k]
p3 [reltuples=1M, changes_since_analyze=90k]
p4 [reltuples=1M, changes_since_analyze=90k]

Now p2 gets analyzed, and the value gets propagate to p, triggering the
analyze. But that's bogus - we've already seen 90k of those rows in the
last analyze, the "actual" changes_since_analyze is just 310k and that
should have not triggered the analyze.

I could invent a much more extreme examples with more partitions, and or
with multiple autovacuum workers processing the leaf rels concurrently.

This seems like a quite serious issue, because analyzes on partitioned
tables sample all the partitions, which seems rather expensive. That is
not an issue introduced by this patch, of course, but it's good to keep
that in mind and not make the matters worse.

Note: I do have some ideas about how to improve that, I've started a
separate thread about it [1].

IMHO the primary issue is the patch is trying to report the counts from
the workers, and it's done incrementally, after the fact. It tries to
prevent the issue by not propagating the counts to parents analyzed in
the same round, but that doesn't seems sufficient:

- There's no guarantee how long ago the parent was analyzed. Maybe it
was 1 second ago, or maybe it was 24h ago and there have been many new
modifications since then?

- The hash table is per worker, so who knows what did the other
autovacuum workers do?

So not really a good solution, I'm afraid.

I propose a different approach - instead of propagating the counts in
do_analyze_rel for individual leaf tables, let's do that in bulk in
relation_needs_vacanalyze. Before the (existing) first pass over
pg_class, we can add a new one, propagating counts from leaf tables to
parents. I'd imagine something like this

while ((tuple = heap_getnext(relScan, ... != NULL)
{
Form_pg_class classForm = (Form_pg_class) GETSTRUCT(tuple);

... find all ancestors for classForm ...

pgstat_propagate_changes(classForm, ancestors);
}

The pgstat_propagate_changes() simply informs the pgstat collector that
classForm has certain ancestors, and it propagates the value to all of
them. There's a problem, though - we can't reset the value for the leaf
table, because we need to check if it needs analyze, but we also don't
want to sent it again next time. But we can add another counter,
tracking that part of changes_since_analyze we already propagated, and
propagate only the difference. That is, we now have this:

PgStat_Counter changes_since_analyze;
PgStat_Counter changes_since_analyze_reported;

So for example we start with

changes_since_analyze = 10000;
changes_since_analyze_reported = 0;

and we propagate 10k to parents:

changes_since_analyze = 10000;
changes_since_analyze_reported = 10000;

but we don't analyze anything, and we accumulate 5k more changes:

changes_since_analyze = 15000;
changes_since_analyze_reported = 10000;

so now we propagate only the 5k delta. And so on. It's not exactly
atomic change (we still do this per relation), but it's "bulk" in the
sense that we force the propagation and don't wait until after the leaf
happens to be analyzed.

It might need to reread the stats file I think, to get the incremented
values, but that seems acceptable.

We may need to "sync" the counts for individual relations in a couple
places (e.g. after the worker is done with the leaf, it should propagate
the remaining delta before resetting the values to 0). Maybe multi-level
partitioning needs some additional handling, not sure.

regards

[1] https://commitfest.postgresql.org/33/3052/

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message tanghy.fnst@fujitsu.com 2021-03-30 02:51:26 Copyright update for nbtsearch.c
Previous Message Kohei KaiGai 2021-03-30 01:11:30 Re: TRUNCATE on foreign table