Re: Autovacuum on partitioned table (autoanalyze)

From: Andres Freund <andres(at)anarazel(dot)de>
To: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: yuzuko <yuzukohosoya(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, David Steele <david(at)pgmasters(dot)net>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, Daniel Gustafsson <daniel(at)yesql(dot)se>, Amit Langote <amitlangote09(at)gmail(dot)com>, Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>, Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Greg Stark <stark(at)mit(dot)edu>
Subject: Re: Autovacuum on partitioned table (autoanalyze)
Date: 2021-07-22 20:54:58
Message-ID: 20210722205458.f2bug3z6qzxzpx2s@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2021-04-08 01:20:14 -0400, Alvaro Herrera wrote:
> On 2021-Apr-07, Alvaro Herrera wrote:
>
> > OK, I bit the bullet and re-did the logic in the way I had proposed
> > earlier in the thread: do the propagation on the collector's side, by
> > sending only the list of ancestors: the collector can read the tuple
> > change count by itself, to add it to each ancestor. This seems less
> > wasteful. Attached is v16 which does it that way and seems to work
> > nicely under my testing.
>
> Pushed with this approach. Thanks for persisting with this.

I'm looking at this in the context of rebasing & polishing the shared
memory stats patch.

I have a few questions / concerns:

1) Somehow it seems like a violation to do stuff like
get_partition_ancestors() in pgstat.c. It's nothing I can't live with, but
it feels a bit off. Would likely not be too hard to address, e.g. by just
putting some of pgstat_report_anl_ancestors in partition.c instead.

2) Why does it make sense that autovacuum sends a stats message for every
partition in the system that had any chances since the last autovacuum
cycle? On a database with a good number of objects / a short naptime we'll
often end up sending messages for the same set of tables from separate
workers, because they don't yet see the concurrent
tabentry->changes_since_analyze_reported.

3) What is the goal of the autovac_refresh_stats() after the loop doing
pgstat_report_anl_ancestors()? I think it'll be common that the stats
collector hasn't even processed the incoming messages by that point, not to
speak of actually having written out a new stats file. If it took less than
10ms (PGSTAT_RETRY_DELAY) to get to autovac_refresh_stats(),
backend_read_statsfile() will not wait for a new stats file to be written
out, and we'll just re-read the state we previously did.

It's pretty expensive to re-read the stats file in some workloads, so I'm a
bit concerned that we end up significantly increasing the amount of stats
updates/reads, without actually gaining anything reliable?

4) In the shared mem stats patch I went to a fair bit of trouble to try to get
rid of pgstat_vacuum_stat() (which scales extremely poorly to larger
systems). For that to work pending stats can only be "staged" while holding
a lock on a relation that prevents the relation from being concurrently
dropped (pending stats increment a refcount for the shared stats object,
which ensures that we don't loose track of the fact that a stats object has
been dropped, even when stats only get submitted later).

I'm not yet clear on how to make this work for
pgstat_report_anl_ancestors() - but I probably can find a way. But it does
feel a bit off to issue stats stuff for tables we're not sure still exist.

I'll go and read through the thread, but my first thought is that having a
hashtable in do_autovacuum() that contains stats for partitioned tables would
be a good bit more efficient than the current approach? We already have a
hashtable for each toast table, compared to that having a hashtable for each
partitioned table doesn't seem like it'd be a problem?

With a small bit of extra work that could even avoid the need for the
additional pass through pg_class. Do the partitioned table data-gathering as
part of the "collect main tables to vacuum" pass, and then do one of

a) do the perform-analyze decision purely off the contents of that
partioned-table hash
b) fetch the RELOID syscache entry by oid and then decide based on that
c) handle partioned tableas as part of the "check TOAST tables" pass - it's
not like we gain a meaningful amount of efficiency by using a ScanKey to
filter for RELKIND_TOASTVALUE, given that there's no index, and that an
index wouldn't commonly be useful given the percentage of toast tables in
pg_class

Partitioning makes it a bigger issue that do_autovacuum() does multiple passes
through pg_class (as it makes scenarios in which pg_class is large more
common), so I don't think it's great that partitioning also increases the
number of passes through pg_class.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2021-07-22 21:09:54 Re: Incorrect usage of strtol, atoi for non-numeric junk inputs
Previous Message Bauyrzhan Sakhariyev 2021-07-22 20:49:17 Re: truncating timestamps on arbitrary intervals