Re: autovac issue with large number of tables

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Kasahara Tatsuhito <kasahara(dot)tatsuhito(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jim Nasby <nasbyj(at)amazon(dot)com>, Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: autovac issue with large number of tables
Date: 2020-11-25 05:17:01
Message-ID: CAD21AoBQdJ2Ro-7mD6=ZXt_2hox=9WpPX5s7ZrkMRaE+DXbULw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Sep 4, 2020 at 7:50 PM Kasahara Tatsuhito
<kasahara(dot)tatsuhito(at)gmail(dot)com> wrote:
>
> Hi,
>
> On Wed, Sep 2, 2020 at 2:10 AM Kasahara Tatsuhito
> <kasahara(dot)tatsuhito(at)gmail(dot)com> wrote:
> > > I wonder if we could have table_recheck_autovac do two probes of the stats
> > > data. First probe the existing stats data, and if it shows the table to
> > > be already vacuumed, return immediately. If not, *then* force a stats
> > > re-read, and check a second time.
> > Does the above mean that the second and subsequent table_recheck_autovac()
> > will be improved to first check using the previous refreshed statistics?
> > I think that certainly works.
> >
> > If that's correct, I'll try to create a patch for the PoC
>
> I still don't know how to reproduce Jim's troubles, but I was able to reproduce
> what was probably a very similar problem.
>
> This problem seems to be more likely to occur in cases where you have
> a large number of tables,
> i.e., a large amount of stats, and many small tables need VACUUM at
> the same time.
>
> So I followed Tom's advice and created a patch for the PoC.
> This patch will enable a flag in the table_recheck_autovac function to use
> the existing stats next time if VACUUM (or ANALYZE) has already been done
> by another worker on the check after the stats have been updated.
> If the tables continue to require VACUUM after the refresh, then a refresh
> will be required instead of using the existing statistics.
>
> I did simple test with HEAD and HEAD + this PoC patch.
> The tests were conducted in two cases.
> (I changed few configurations. see attached scripts)
>
> 1. Normal VACUUM case
> - SET autovacuum = off
> - CREATE tables with 100 rows
> - DELETE 90 rows for each tables
> - SET autovacuum = on and restart PostgreSQL
> - Measure the time it takes for all tables to be VACUUMed
>
> 2. Anti wrap round VACUUM case
> - CREATE brank tables
> - SELECT all of these tables (for generate stats)
> - SET autovacuum_freeze_max_age to low values and restart PostgreSQL
> - Consumes a lot of XIDs by using txid_curent()
> - Measure the time it takes for all tables to be VACUUMed
>
> For each test case, the following results were obtained by changing
> autovacuum_max_workers parameters to 1, 2, 3(def) 5 and 10.
> Also changing num of tables to 1000, 5000, 10000 and 20000.
>
> Due to the poor VM environment (2 VCPU/4 GB), the results are a little unstable,
> but I think it's enough to ask for a trend.
>
> ===========================================================================
> [1.Normal VACUUM case]
> tables:1000
> autovacuum_max_workers 1: (HEAD) 20 sec VS (with patch) 20 sec
> autovacuum_max_workers 2: (HEAD) 18 sec VS (with patch) 16 sec
> autovacuum_max_workers 3: (HEAD) 18 sec VS (with patch) 16 sec
> autovacuum_max_workers 5: (HEAD) 19 sec VS (with patch) 17 sec
> autovacuum_max_workers 10: (HEAD) 19 sec VS (with patch) 17 sec
>
> tables:5000
> autovacuum_max_workers 1: (HEAD) 77 sec VS (with patch) 78 sec
> autovacuum_max_workers 2: (HEAD) 61 sec VS (with patch) 43 sec
> autovacuum_max_workers 3: (HEAD) 38 sec VS (with patch) 38 sec
> autovacuum_max_workers 5: (HEAD) 45 sec VS (with patch) 37 sec
> autovacuum_max_workers 10: (HEAD) 43 sec VS (with patch) 35 sec
>
> tables:10000
> autovacuum_max_workers 1: (HEAD) 152 sec VS (with patch) 153 sec
> autovacuum_max_workers 2: (HEAD) 119 sec VS (with patch) 98 sec
> autovacuum_max_workers 3: (HEAD) 87 sec VS (with patch) 78 sec
> autovacuum_max_workers 5: (HEAD) 100 sec VS (with patch) 66 sec
> autovacuum_max_workers 10: (HEAD) 97 sec VS (with patch) 56 sec
>
> tables:20000
> autovacuum_max_workers 1: (HEAD) 338 sec VS (with patch) 339 sec
> autovacuum_max_workers 2: (HEAD) 231 sec VS (with patch) 229 sec
> autovacuum_max_workers 3: (HEAD) 220 sec VS (with patch) 191 sec
> autovacuum_max_workers 5: (HEAD) 234 sec VS (with patch) 147 sec
> autovacuum_max_workers 10: (HEAD) 320 sec VS (with patch) 113 sec
>
> [2.Anti wrap round VACUUM case]
> tables:1000
> autovacuum_max_workers 1: (HEAD) 19 sec VS (with patch) 18 sec
> autovacuum_max_workers 2: (HEAD) 14 sec VS (with patch) 15 sec
> autovacuum_max_workers 3: (HEAD) 14 sec VS (with patch) 14 sec
> autovacuum_max_workers 5: (HEAD) 14 sec VS (with patch) 16 sec
> autovacuum_max_workers 10: (HEAD) 16 sec VS (with patch) 14 sec
>
> tables:5000
> autovacuum_max_workers 1: (HEAD) 69 sec VS (with patch) 69 sec
> autovacuum_max_workers 2: (HEAD) 66 sec VS (with patch) 47 sec
> autovacuum_max_workers 3: (HEAD) 59 sec VS (with patch) 37 sec
> autovacuum_max_workers 5: (HEAD) 39 sec VS (with patch) 28 sec
> autovacuum_max_workers 10: (HEAD) 39 sec VS (with patch) 29 sec
>
> tables:10000
> autovacuum_max_workers 1: (HEAD) 139 sec VS (with patch) 138 sec
> autovacuum_max_workers 2: (HEAD) 130 sec VS (with patch) 86 sec
> autovacuum_max_workers 3: (HEAD) 120 sec VS (with patch) 68 sec
> autovacuum_max_workers 5: (HEAD) 96 sec VS (with patch) 41 sec
> autovacuum_max_workers 10: (HEAD) 90 sec VS (with patch) 39 sec
>
> tables:20000
> autovacuum_max_workers 1: (HEAD) 313 sec VS (with patch) 331 sec
> autovacuum_max_workers 2: (HEAD) 209 sec VS (with patch) 201 sec
> autovacuum_max_workers 3: (HEAD) 227 sec VS (with patch) 141 sec
> autovacuum_max_workers 5: (HEAD) 236 sec VS (with patch) 88 sec
> autovacuum_max_workers 10: (HEAD) 309 sec VS (with patch) 74 sec
> ===========================================================================
>
> The cases without patch, the scalability of the worker has decreased
> as the number of tables has increased.
> In fact, the more workers there are, the longer it takes to complete
> VACUUM to all tables.
> The cases with patch, it shows good scalability with respect to the
> number of workers.

It seems a good performance improvement even without the patch of
shared memory based stats collector.

>
> Note that perf top results showed that hash_search_with_hash_value,
> hash_seq_search and
> pgstat_read_statsfiles are dominant during VACUUM in all patterns,
> with or without the patch.
>
> Therefore, there is still a need to find ways to optimize the reading
> of large amounts of stats.
> However, this patch is effective in its own right, and since there are
> only a few parts to modify,
> I think it should be able to be applied to current (preferably
> pre-v13) PostgreSQL.

+1

+
+ /* We might be better to refresh stats */
+ use_existing_stats = false;
}
+ else
+ {

- heap_freetuple(classTup);
+ heap_freetuple(classTup);
+ /* The relid has already vacuumed, so we might be better to
use exiting stats */
+ use_existing_stats = true;
+ }

With that patch, the autovacuum process refreshes the stats in the
next check if it finds out that the table still needs to be vacuumed.
But I guess it's not necessarily true because the next table might be
vacuumed already. So I think we might want to always use the existing
for the first check. What do you think?

Regards,

--
Masahiko Sawada
EnterpriseDB: https://www.enterprisedb.com/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andy Fan 2020-11-25 05:24:14 Re: [doc] plan invalidation when statistics are update
Previous Message Fujii Masao 2020-11-25 05:13:17 Re: [doc] plan invalidation when statistics are update