Re: autovac issue with large number of tables

From: Kasahara Tatsuhito <kasahara(dot)tatsuhito(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Jim Nasby <nasbyj(at)amazon(dot)com>, Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: autovac issue with large number of tables
Date: 2020-09-10 09:29:09
Message-ID: CAP0=ZVJX1QTjbauM4nCT9Eccvog1Ab+WzJ5_jjymvgaeU8uEtw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Therefore, we expect this patch [1] to be committed for its original
purpose, as well as to improve autovacuum from v14 onwards.Hi,

On Fri, Sep 4, 2020 at 7:50 PM Kasahara Tatsuhito
<kasahara(dot)tatsuhito(at)gmail(dot)com> wrote:
>
> Hi,
>
> On Wed, Sep 2, 2020 at 2:10 AM Kasahara Tatsuhito
> <kasahara(dot)tatsuhito(at)gmail(dot)com> wrote:
> > > I wonder if we could have table_recheck_autovac do two probes of the stats
> > > data. First probe the existing stats data, and if it shows the table to
> > > be already vacuumed, return immediately. If not, *then* force a stats
> > > re-read, and check a second time.
> > Does the above mean that the second and subsequent table_recheck_autovac()
> > will be improved to first check using the previous refreshed statistics?
> > I think that certainly works.
> >
> > If that's correct, I'll try to create a patch for the PoC
>
> I still don't know how to reproduce Jim's troubles, but I was able to reproduce
> what was probably a very similar problem.
>
> This problem seems to be more likely to occur in cases where you have
> a large number of tables,
> i.e., a large amount of stats, and many small tables need VACUUM at
> the same time.
>
> So I followed Tom's advice and created a patch for the PoC.
> This patch will enable a flag in the table_recheck_autovac function to use
> the existing stats next time if VACUUM (or ANALYZE) has already been done
> by another worker on the check after the stats have been updated.
> If the tables continue to require VACUUM after the refresh, then a refresh
> will be required instead of using the existing statistics.
>
> I did simple test with HEAD and HEAD + this PoC patch.
> The tests were conducted in two cases.
> (I changed few configurations. see attached scripts)
>
> 1. Normal VACUUM case
> - SET autovacuum = off
> - CREATE tables with 100 rows
> - DELETE 90 rows for each tables
> - SET autovacuum = on and restart PostgreSQL
> - Measure the time it takes for all tables to be VACUUMed
>
> 2. Anti wrap round VACUUM case
> - CREATE brank tables
> - SELECT all of these tables (for generate stats)
> - SET autovacuum_freeze_max_age to low values and restart PostgreSQL
> - Consumes a lot of XIDs by using txid_curent()
> - Measure the time it takes for all tables to be VACUUMed
>
> For each test case, the following results were obtained by changing
> autovacuum_max_workers parameters to 1, 2, 3(def) 5 and 10.
> Also changing num of tables to 1000, 5000, 10000 and 20000.
>
> Due to the poor VM environment (2 VCPU/4 GB), the results are a little unstable,
> but I think it's enough to ask for a trend.
>
> ===========================================================================
> [1.Normal VACUUM case]
> tables:1000
> autovacuum_max_workers 1: (HEAD) 20 sec VS (with patch) 20 sec
> autovacuum_max_workers 2: (HEAD) 18 sec VS (with patch) 16 sec
> autovacuum_max_workers 3: (HEAD) 18 sec VS (with patch) 16 sec
> autovacuum_max_workers 5: (HEAD) 19 sec VS (with patch) 17 sec
> autovacuum_max_workers 10: (HEAD) 19 sec VS (with patch) 17 sec
>
> tables:5000
> autovacuum_max_workers 1: (HEAD) 77 sec VS (with patch) 78 sec
> autovacuum_max_workers 2: (HEAD) 61 sec VS (with patch) 43 sec
> autovacuum_max_workers 3: (HEAD) 38 sec VS (with patch) 38 sec
> autovacuum_max_workers 5: (HEAD) 45 sec VS (with patch) 37 sec
> autovacuum_max_workers 10: (HEAD) 43 sec VS (with patch) 35 sec
>
> tables:10000
> autovacuum_max_workers 1: (HEAD) 152 sec VS (with patch) 153 sec
> autovacuum_max_workers 2: (HEAD) 119 sec VS (with patch) 98 sec
> autovacuum_max_workers 3: (HEAD) 87 sec VS (with patch) 78 sec
> autovacuum_max_workers 5: (HEAD) 100 sec VS (with patch) 66 sec
> autovacuum_max_workers 10: (HEAD) 97 sec VS (with patch) 56 sec
>
> tables:20000
> autovacuum_max_workers 1: (HEAD) 338 sec VS (with patch) 339 sec
> autovacuum_max_workers 2: (HEAD) 231 sec VS (with patch) 229 sec
> autovacuum_max_workers 3: (HEAD) 220 sec VS (with patch) 191 sec
> autovacuum_max_workers 5: (HEAD) 234 sec VS (with patch) 147 sec
> autovacuum_max_workers 10: (HEAD) 320 sec VS (with patch) 113 sec
>
> [2.Anti wrap round VACUUM case]
> tables:1000
> autovacuum_max_workers 1: (HEAD) 19 sec VS (with patch) 18 sec
> autovacuum_max_workers 2: (HEAD) 14 sec VS (with patch) 15 sec
> autovacuum_max_workers 3: (HEAD) 14 sec VS (with patch) 14 sec
> autovacuum_max_workers 5: (HEAD) 14 sec VS (with patch) 16 sec
> autovacuum_max_workers 10: (HEAD) 16 sec VS (with patch) 14 sec
>
> tables:5000
> autovacuum_max_workers 1: (HEAD) 69 sec VS (with patch) 69 sec
> autovacuum_max_workers 2: (HEAD) 66 sec VS (with patch) 47 sec
> autovacuum_max_workers 3: (HEAD) 59 sec VS (with patch) 37 sec
> autovacuum_max_workers 5: (HEAD) 39 sec VS (with patch) 28 sec
> autovacuum_max_workers 10: (HEAD) 39 sec VS (with patch) 29 sec
>
> tables:10000
> autovacuum_max_workers 1: (HEAD) 139 sec VS (with patch) 138 sec
> autovacuum_max_workers 2: (HEAD) 130 sec VS (with patch) 86 sec
> autovacuum_max_workers 3: (HEAD) 120 sec VS (with patch) 68 sec
> autovacuum_max_workers 5: (HEAD) 96 sec VS (with patch) 41 sec
> autovacuum_max_workers 10: (HEAD) 90 sec VS (with patch) 39 sec
>
> tables:20000
> autovacuum_max_workers 1: (HEAD) 313 sec VS (with patch) 331 sec
> autovacuum_max_workers 2: (HEAD) 209 sec VS (with patch) 201 sec
> autovacuum_max_workers 3: (HEAD) 227 sec VS (with patch) 141 sec
> autovacuum_max_workers 5: (HEAD) 236 sec VS (with patch) 88 sec
> autovacuum_max_workers 10: (HEAD) 309 sec VS (with patch) 74 sec
> ===========================================================================
>
> The cases without patch, the scalability of the worker has decreased
> as the number of tables has increased.
> In fact, the more workers there are, the longer it takes to complete
> VACUUM to all tables.
> The cases with patch, it shows good scalability with respect to the
> number of workers.
>
> Note that perf top results showed that hash_search_with_hash_value,
> hash_seq_search and
> pgstat_read_statsfiles are dominant during VACUUM in all patterns,
> with or without the patch.
>
> Therefore, there is still a need to find ways to optimize the reading
> of large amounts of stats.
> However, this patch is effective in its own right, and since there are
> only a few parts to modify,
> I think it should be able to be applied to current (preferably
> pre-v13) PostgreSQL.
>
> The patch and reproduce scripts were attached.
>
> Thoughts ?

Hi.

I ran the same test with a patch[1] that manages the statistics on
shared memory.
This patch is expected to reduce the burden of refreshing large
amounts of stats.

And the following results were obtained.
(The results for HEAD are the same as in my last post.)

========================================================================================
[1.Normal VACUUM case]
tables:1000
autovacuum_max_workers 1: (HEAD) 20 sec VS (with shared_base_stast
patch) 8 sec
autovacuum_max_workers 2: (HEAD) 18 sec VS (with shared_base_stast
patch) 8 sec
autovacuum_max_workers 3: (HEAD) 18 sec VS (with shared_base_stast
patch) 8 sec
autovacuum_max_workers 5: (HEAD) 19 sec VS (with shared_base_stast
patch) 9 sec
autovacuum_max_workers 10: (HEAD) 19 sec VS (with shared_base_stast
patch) 9 sec

tables:5000
autovacuum_max_workers 1: (HEAD) 77 sec VS (with shared_base_stast
patch) 13 sec
autovacuum_max_workers 2: (HEAD) 61 sec VS (with shared_base_stast
patch) 12 sec
autovacuum_max_workers 3: (HEAD) 38 sec VS (with shared_base_stast
patch) 13 sec
autovacuum_max_workers 5: (HEAD) 45 sec VS (with shared_base_stast
patch) 12 sec
autovacuum_max_workers 10: (HEAD) 43 sec VS (with shared_base_stast
patch) 12 sec

tables:10000
autovacuum_max_workers 1: (HEAD) 152 sec VS (with
shared_base_stast patch) 18 sec
autovacuum_max_workers 2: (HEAD) 119 sec VS (with
shared_base_stast patch) 25 sec
autovacuum_max_workers 3: (HEAD) 87 sec VS (with
shared_base_stast patch) 28 sec
autovacuum_max_workers 5: (HEAD) 100 sec VS (with
shared_base_stast patch) 28 sec
autovacuum_max_workers 10: (HEAD) 97 sec VS (with
shared_base_stast patch) 29 sec

tables:20000
autovacuum_max_workers 1: (HEAD) 338 sec VS (with
shared_base_stast patch) 27 sec
autovacuum_max_workers 2: (HEAD) 231 sec VS (with
shared_base_stast patch) 54 sec
autovacuum_max_workers 3: (HEAD) 220 sec VS (with
shared_base_stast patch) 67 sec
autovacuum_max_workers 5: (HEAD) 234 sec VS (with
shared_base_stast patch) 75 sec
autovacuum_max_workers 10: (HEAD) 320 sec VS (with
shared_base_stast patch) 83 sec

[2.Anti wrap round VACUUM case]
tables:1000
autovacuum_max_workers 1: (HEAD) 19 sec VS (with shared_base_stats
patch) 6 sec
autovacuum_max_workers 2: (HEAD) 14 sec VS (with shared_base_stats
patch) 7 sec
autovacuum_max_workers 3: (HEAD) 14 sec VS (with shared_base_stats
patch) 6 sec
autovacuum_max_workers 5: (HEAD) 14 sec VS (with shared_base_stats
patch) 6 sec
autovacuum_max_workers 10: (HEAD) 16 sec VS (with shared_base_stats
patch) 7 sec

tables:5000
autovacuum_max_workers 1: (HEAD) 69 sec VS (with shared_base_stats
patch) 8 sec
autovacuum_max_workers 2: (HEAD) 66 sec VS (with shared_base_stats
patch) 8 sec
autovacuum_max_workers 3: (HEAD) 59 sec VS (with shared_base_stats
patch) 8 sec
autovacuum_max_workers 5: (HEAD) 39 sec VS (with shared_base_stats
patch) 9 sec
autovacuum_max_workers 10: (HEAD) 39 sec VS (with shared_base_stats
patch) 8 sec

tables:10000
autovacuum_max_workers 1: (HEAD) 139 sec VS (with
shared_base_stats patch) 9 sec
autovacuum_max_workers 2: (HEAD) 130 sec VS (with
shared_base_stats patch) 9 sec
autovacuum_max_workers 3: (HEAD) 120 sec VS (with
shared_base_stats patch) 9 sec
autovacuum_max_workers 5: (HEAD) 96 sec VS (with
shared_base_stats patch) 8 sec
autovacuum_max_workers 10: (HEAD) 90 sec VS (with
shared_base_stats patch) 9 sec

tables:20000
autovacuum_max_workers 1: (HEAD) 313 sec VS (with
shared_base_stats patch) 12 sec
autovacuum_max_workers 2: (HEAD) 209 sec VS (with
shared_base_stats patch) 12 sec
autovacuum_max_workers 3: (HEAD) 227 sec VS (with
shared_base_stats patch) 12 sec
autovacuum_max_workers 5: (HEAD) 236 sec VS (with
shared_base_stats patch) 11 sec
autovacuum_max_workers 10: (HEAD) 309 sec VS (with
shared_base_stats patch) 12 sec
========================================================================================

This patch provided a very nice speedup in both cases.
However, in case 1, when the number of tables is large, there is an
increase in the time required
as the number of workers increases.
Whether this is due to CPU and IO conflicts or patch characteristics
is not yet known.
Nevertheless, at least the problems associated with
table_recheck_autovac() appear to have been resolved.

So, I hope that this patch [1] to be committed for its original purpose,
as well as to improve autovacuum of v14 and later.

The other patch I submitted (v1_mod_table_recheck_autovac.patch) is
useful for slight
improving autovacuum of PostgreSQL 13 and before.
Is it worth backporting this patch to current PostgreSQL 13 and earlier?

Best regards,

[1] https://www.postgresql.org/message-id/20200908.175557.617150409868541587.horikyota.ntt%40gmail.com

--
Tatsuhito Kasahara
kasahara.tatsuhito _at_ gmail.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro Horiguchi 2020-09-10 09:37:09 Re: Strange behavior with polygon and NaN
Previous Message Amit Kapila 2020-09-10 09:17:44 Re: Bug in logical decoding of in-progress transactions