Re: New GUC autovacuum_max_threshold ?

From: Frédéric Yhuel <frederic(dot)yhuel(at)dalibo(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>, Nathan Bossart <nathandbossart(at)gmail(dot)com>
Cc: Melanie Plageman <melanieplageman(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, David Rowley <dgrowleyml(at)gmail(dot)com>, Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>
Subject: Re: New GUC autovacuum_max_threshold ?
Date: 2024-04-26 08:10:20
Message-ID: ba3a146c-1bdb-45f4-a7fa-02713783a811@dalibo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Le 25/04/2024 à 22:21, Robert Haas a écrit :
> The analyze case, I feel, is really murky.
> autovacuum_analyze_scale_factor stands for the proposition that as the
> table becomes larger, analyze doesn't need to be done as often. If
> what you're concerned about is the frequency estimates, that's true:
> an injection of a million new rows can shift frequencies dramatically
> in a small table, but the effect is blunted in a large one. But a lot
> of the cases I've seen have involved the histogram boundaries. If
> you're inserting data into a table in increasing order, every new
> million rows shifts the boundary of the last histogram bucket by the
> same amount. You either need those rows included in the histogram to
> get good query plans, or you don't. If you do, the frequency with
> which you need to analyze does not change as the table grows. If you
> don't, then it probably does. But the answer doesn't really depend on
> how big the table is already, but on your workload. So it's unclear to
> me that the proposed parameter is the right idea here at all. It's
> also unclear to me that the existing system is the right idea. 🙂

This is very interesting. And what about ndistinct? I believe it could
be problematic, too, in some (admittedly rare or pathological) cases.

For example, suppose that the actual number of distinct values grows
from 1000 to 200000 after a batch of insertions, for a particular
column. OK, in such a case, the default analyze sampling isn't large
enough to compute a ndistinct close enough to reality anyway. But
without any analyze at all, it can lead to very bad planning - think of
a Nested Loop with a parallel seq scan for the outer table instead of a
simple efficient index scan, because the index scan of the inner table
is overestimated (each index scan cost and number or rows returned).

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Laurenz Albe 2024-04-26 08:18:00 Re: New GUC autovacuum_max_threshold ?
Previous Message Michael Banck 2024-04-26 08:08:33 Re: New GUC autovacuum_max_threshold ?