Re: Autovacuum on partitioned table (autoanalyze)

From: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc: yuzukohosoya(at)gmail(dot)com, pryzby(at)telsasoft(dot)com, daniel(at)yesql(dot)se, amitlangote09(at)gmail(dot)com, masahiko(dot)sawada(at)2ndquadrant(dot)com, laurenz(dot)albe(at)cybertec(dot)at, pgsql-hackers(at)lists(dot)postgresql(dot)org, stark(at)mit(dot)edu
Subject: Re: Autovacuum on partitioned table (autoanalyze)
Date: 2020-11-30 16:07:12
Message-ID: 20201130160712.GA20727@alvherre.pgsql
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I looked at both Yuzuko Hosoya's patch and Kyotaro Horiguchi's, and
think we're doing things in a quite complicated manner, which perhaps
could be done more easily.

Hosoya's patch has pgstat_report_analyze call pgstat_get_tab_entry() for
the table being vacuumed, then obtains the list of ancestors, and then
sends for each ancestor a new message containing the partition's
changes_since_analyze for that ancestor. When stat collector receives
that message, it adds the number to the ancestor's m_changed_tuples.

Horiguchi's doing a similar thing, only differently: it is do_analyze_rel
that reads the count from the collector (this time by calling SQL
function pg_stat_get_mod_since_analyze) and then sends number back to
the collector for each ancestor.

I suggest that a better way to do this, is to forget about the new
"partchanges" message completely. Instead, let's add an array of
ancestors to the analyze message (borrowing from PgStat_MsgFuncstat).
Something like this:

#define PGSTAT_NUM_ANCESTORENTRIES \
((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(Oid) - sizeof(bool) - \
sizeof(bool) - sizeof(TimestampTz) - sizeof(PgStat_Counter) - \
sizeof(PgStat_Counter) - sizeof(int)) / sizeof(Oid))
typedef struct PgStat_MsgAnalyze
{
PgStat_MsgHdr m_hdr;
Oid m_databaseid;
Oid m_tableoid;
bool m_autovacuum;
bool m_resetcounter;
TimestampTz m_analyzetime;
PgStat_Counter m_live_tuples;
PgStat_Counter m_dead_tuples;
int m_nancestors;
Oid m_ancestors[PGSTAT_NUM_ANCESTORENTRIES];
} PgStat_MsgAnalyze;

For non-partitions, m_nancestors would be 0, so the message would be
handled as today. For partitions, the array carries the OID of all
ancestors. When the collector receives this message, first it looks up
the pgstat entries for each ancestors in the array, and it adds the
partition's current changes_since_analyze to the ancestor's
changes_since_analyze. Then it does things as currently, including
reset the changes_since_analyze counter for the partition.

Key point in this is that we don't need to read the number from
collector into the backend executing analyze. We just *send* the data
about ancestors, and the collector knows what to do with it.

One possible complaint is: what if there are more ancestors that fit in
the message? I propose that this problem can be ignored, since in order
to hit this, you'd need to have (1000-8-4-4-1-1-8-8-8-4)/4 = 238
ancestors (if my math is right). I doubt we'll hit the need to use 238
levels of partitionings before a stat collector rewrite occurs ...

(It is possible to remove that restriction by doing more complicated
things such as sending the list of ancestor in a new type of message
that can be sent several times, prior to the analyze message itself, but
I don't think this is worth the trouble.)

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrey Lepikhov 2020-11-30 16:19:11 Cost overestimation of foreign JOIN
Previous Message Fujii Masao 2020-11-30 16:04:18 Re: Feature improvement for pg_stat_statements