Quick Links

Re: autovac issue with large number of tables

From:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
To:	Kasahara Tatsuhito <kasahara(dot)tatsuhito(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jim Nasby <nasbyj(at)amazon(dot)com>, Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: autovac issue with large number of tables
Date:	2020-12-03 12:09:32
Message-ID:	4425d4ef-4ae8-95ce-3fa6-81206ee12e02@oss.nttdata.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 2020/12/03 11:46, Kasahara Tatsuhito wrote:
> On Wed, Dec 2, 2020 at 7:11 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>>
>> On Wed, Dec 2, 2020 at 3:33 PM Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com> wrote:
>>>
>>>
>>>
>>> On 2020/12/02 12:53, Masahiko Sawada wrote:
>>>> On Tue, Dec 1, 2020 at 5:31 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>>>>>
>>>>> On Tue, Dec 1, 2020 at 4:32 PM Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 2020/12/01 16:23, Masahiko Sawada wrote:
>>>>>>> On Tue, Dec 1, 2020 at 1:48 PM Kasahara Tatsuhito
>>>>>>> <kasahara(dot)tatsuhito(at)gmail(dot)com> wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> On Mon, Nov 30, 2020 at 8:59 PM Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 2020/11/30 10:43, Masahiko Sawada wrote:
>>>>>>>>>> On Sun, Nov 29, 2020 at 10:34 PM Kasahara Tatsuhito
>>>>>>>>>> <kasahara(dot)tatsuhito(at)gmail(dot)com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi, Thanks for you comments.
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Nov 27, 2020 at 9:51 PM Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 2020/11/27 18:38, Kasahara Tatsuhito wrote:
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Nov 27, 2020 at 1:43 AM Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 2020/11/26 10:41, Kasahara Tatsuhito wrote:
>>>>>>>>>>>>>>> On Wed, Nov 25, 2020 at 8:46 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed, Nov 25, 2020 at 4:18 PM Kasahara Tatsuhito
>>>>>>>>>>>>>>>> <kasahara(dot)tatsuhito(at)gmail(dot)com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Wed, Nov 25, 2020 at 2:17 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Fri, Sep 4, 2020 at 7:50 PM Kasahara Tatsuhito
>>>>>>>>>>>>>>>>>> <kasahara(dot)tatsuhito(at)gmail(dot)com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Wed, Sep 2, 2020 at 2:10 AM Kasahara Tatsuhito
>>>>>>>>>>>>>>>>>>> <kasahara(dot)tatsuhito(at)gmail(dot)com> wrote:
>>>>>>>>>>>>>>>>>>>>> I wonder if we could have table_recheck_autovac do two probes of the stats
>>>>>>>>>>>>>>>>>>>>> data. First probe the existing stats data, and if it shows the table to
>>>>>>>>>>>>>>>>>>>>> be already vacuumed, return immediately. If not, *then* force a stats
>>>>>>>>>>>>>>>>>>>>> re-read, and check a second time.
>>>>>>>>>>>>>>>>>>>> Does the above mean that the second and subsequent table_recheck_autovac()
>>>>>>>>>>>>>>>>>>>> will be improved to first check using the previous refreshed statistics?
>>>>>>>>>>>>>>>>>>>> I think that certainly works.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> If that's correct, I'll try to create a patch for the PoC
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I still don't know how to reproduce Jim's troubles, but I was able to reproduce
>>>>>>>>>>>>>>>>>>> what was probably a very similar problem.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> This problem seems to be more likely to occur in cases where you have
>>>>>>>>>>>>>>>>>>> a large number of tables,
>>>>>>>>>>>>>>>>>>> i.e., a large amount of stats, and many small tables need VACUUM at
>>>>>>>>>>>>>>>>>>> the same time.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> So I followed Tom's advice and created a patch for the PoC.
>>>>>>>>>>>>>>>>>>> This patch will enable a flag in the table_recheck_autovac function to use
>>>>>>>>>>>>>>>>>>> the existing stats next time if VACUUM (or ANALYZE) has already been done
>>>>>>>>>>>>>>>>>>> by another worker on the check after the stats have been updated.
>>>>>>>>>>>>>>>>>>> If the tables continue to require VACUUM after the refresh, then a refresh
>>>>>>>>>>>>>>>>>>> will be required instead of using the existing statistics.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I did simple test with HEAD and HEAD + this PoC patch.
>>>>>>>>>>>>>>>>>>> The tests were conducted in two cases.
>>>>>>>>>>>>>>>>>>> (I changed few configurations. see attached scripts)
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 1. Normal VACUUM case
>>>>>>>>>>>>>>>>>>> - SET autovacuum = off
>>>>>>>>>>>>>>>>>>> - CREATE tables with 100 rows
>>>>>>>>>>>>>>>>>>> - DELETE 90 rows for each tables
>>>>>>>>>>>>>>>>>>> - SET autovacuum = on and restart PostgreSQL
>>>>>>>>>>>>>>>>>>> - Measure the time it takes for all tables to be VACUUMed
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 2. Anti wrap round VACUUM case
>>>>>>>>>>>>>>>>>>> - CREATE brank tables
>>>>>>>>>>>>>>>>>>> - SELECT all of these tables (for generate stats)
>>>>>>>>>>>>>>>>>>> - SET autovacuum_freeze_max_age to low values and restart PostgreSQL
>>>>>>>>>>>>>>>>>>> - Consumes a lot of XIDs by using txid_curent()
>>>>>>>>>>>>>>>>>>> - Measure the time it takes for all tables to be VACUUMed
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> For each test case, the following results were obtained by changing
>>>>>>>>>>>>>>>>>>> autovacuum_max_workers parameters to 1, 2, 3(def) 5 and 10.
>>>>>>>>>>>>>>>>>>> Also changing num of tables to 1000, 5000, 10000 and 20000.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Due to the poor VM environment (2 VCPU/4 GB), the results are a little unstable,
>>>>>>>>>>>>>>>>>>> but I think it's enough to ask for a trend.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> ===========================================================================
>>>>>>>>>>>>>>>>>>> [1.Normal VACUUM case]
>>>>>>>>>>>>>>>>>>> tables:1000
>>>>>>>>>>>>>>>>>>> autovacuum_max_workers 1: (HEAD) 20 sec VS (with patch) 20 sec
>>>>>>>>>>>>>>>>>>> autovacuum_max_workers 2: (HEAD) 18 sec VS (with patch) 16 sec
>>>>>>>>>>>>>>>>>>> autovacuum_max_workers 3: (HEAD) 18 sec VS (with patch) 16 sec
>>>>>>>>>>>>>>>>>>> autovacuum_max_workers 5: (HEAD) 19 sec VS (with patch) 17 sec
>>>>>>>>>>>>>>>>>>> autovacuum_max_workers 10: (HEAD) 19 sec VS (with patch) 17 sec
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> tables:5000
>>>>>>>>>>>>>>>>>>> autovacuum_max_workers 1: (HEAD) 77 sec VS (with patch) 78 sec
>>>>>>>>>>>>>>>>>>> autovacuum_max_workers 2: (HEAD) 61 sec VS (with patch) 43 sec
>>>>>>>>>>>>>>>>>>> autovacuum_max_workers 3: (HEAD) 38 sec VS (with patch) 38 sec
>>>>>>>>>>>>>>>>>>> autovacuum_max_workers 5: (HEAD) 45 sec VS (with patch) 37 sec
>>>>>>>>>>>>>>>>>>> autovacuum_max_workers 10: (HEAD) 43 sec VS (with patch) 35 sec
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> tables:10000
>>>>>>>>>>>>>>>>>>> autovacuum_max_workers 1: (HEAD) 152 sec VS (with patch) 153 sec
>>>>>>>>>>>>>>>>>>> autovacuum_max_workers 2: (HEAD) 119 sec VS (with patch) 98 sec
>>>>>>>>>>>>>>>>>>> autovacuum_max_workers 3: (HEAD) 87 sec VS (with patch) 78 sec
>>>>>>>>>>>>>>>>>>> autovacuum_max_workers 5: (HEAD) 100 sec VS (with patch) 66 sec
>>>>>>>>>>>>>>>>>>> autovacuum_max_workers 10: (HEAD) 97 sec VS (with patch) 56 sec
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> tables:20000
>>>>>>>>>>>>>>>>>>> autovacuum_max_workers 1: (HEAD) 338 sec VS (with patch) 339 sec
>>>>>>>>>>>>>>>>>>> autovacuum_max_workers 2: (HEAD) 231 sec VS (with patch) 229 sec
>>>>>>>>>>>>>>>>>>> autovacuum_max_workers 3: (HEAD) 220 sec VS (with patch) 191 sec
>>>>>>>>>>>>>>>>>>> autovacuum_max_workers 5: (HEAD) 234 sec VS (with patch) 147 sec
>>>>>>>>>>>>>>>>>>> autovacuum_max_workers 10: (HEAD) 320 sec VS (with patch) 113 sec
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> [2.Anti wrap round VACUUM case]
>>>>>>>>>>>>>>>>>>> tables:1000
>>>>>>>>>>>>>>>>>>> autovacuum_max_workers 1: (HEAD) 19 sec VS (with patch) 18 sec
>>>>>>>>>>>>>>>>>>> autovacuum_max_workers 2: (HEAD) 14 sec VS (with patch) 15 sec
>>>>>>>>>>>>>>>>>>> autovacuum_max_workers 3: (HEAD) 14 sec VS (with patch) 14 sec
>>>>>>>>>>>>>>>>>>> autovacuum_max_workers 5: (HEAD) 14 sec VS (with patch) 16 sec
>>>>>>>>>>>>>>>>>>> autovacuum_max_workers 10: (HEAD) 16 sec VS (with patch) 14 sec
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> tables:5000
>>>>>>>>>>>>>>>>>>> autovacuum_max_workers 1: (HEAD) 69 sec VS (with patch) 69 sec
>>>>>>>>>>>>>>>>>>> autovacuum_max_workers 2: (HEAD) 66 sec VS (with patch) 47 sec
>>>>>>>>>>>>>>>>>>> autovacuum_max_workers 3: (HEAD) 59 sec VS (with patch) 37 sec
>>>>>>>>>>>>>>>>>>> autovacuum_max_workers 5: (HEAD) 39 sec VS (with patch) 28 sec
>>>>>>>>>>>>>>>>>>> autovacuum_max_workers 10: (HEAD) 39 sec VS (with patch) 29 sec
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> tables:10000
>>>>>>>>>>>>>>>>>>> autovacuum_max_workers 1: (HEAD) 139 sec VS (with patch) 138 sec
>>>>>>>>>>>>>>>>>>> autovacuum_max_workers 2: (HEAD) 130 sec VS (with patch) 86 sec
>>>>>>>>>>>>>>>>>>> autovacuum_max_workers 3: (HEAD) 120 sec VS (with patch) 68 sec
>>>>>>>>>>>>>>>>>>> autovacuum_max_workers 5: (HEAD) 96 sec VS (with patch) 41 sec
>>>>>>>>>>>>>>>>>>> autovacuum_max_workers 10: (HEAD) 90 sec VS (with patch) 39 sec
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> tables:20000
>>>>>>>>>>>>>>>>>>> autovacuum_max_workers 1: (HEAD) 313 sec VS (with patch) 331 sec
>>>>>>>>>>>>>>>>>>> autovacuum_max_workers 2: (HEAD) 209 sec VS (with patch) 201 sec
>>>>>>>>>>>>>>>>>>> autovacuum_max_workers 3: (HEAD) 227 sec VS (with patch) 141 sec
>>>>>>>>>>>>>>>>>>> autovacuum_max_workers 5: (HEAD) 236 sec VS (with patch) 88 sec
>>>>>>>>>>>>>>>>>>> autovacuum_max_workers 10: (HEAD) 309 sec VS (with patch) 74 sec
>>>>>>>>>>>>>>>>>>> ===========================================================================
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> The cases without patch, the scalability of the worker has decreased
>>>>>>>>>>>>>>>>>>> as the number of tables has increased.
>>>>>>>>>>>>>>>>>>> In fact, the more workers there are, the longer it takes to complete
>>>>>>>>>>>>>>>>>>> VACUUM to all tables.
>>>>>>>>>>>>>>>>>>> The cases with patch, it shows good scalability with respect to the
>>>>>>>>>>>>>>>>>>> number of workers.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> It seems a good performance improvement even without the patch of
>>>>>>>>>>>>>>>>>> shared memory based stats collector.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Sounds great!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Note that perf top results showed that hash_search_with_hash_value,
>>>>>>>>>>>>>>>>>>> hash_seq_search and
>>>>>>>>>>>>>>>>>>> pgstat_read_statsfiles are dominant during VACUUM in all patterns,
>>>>>>>>>>>>>>>>>>> with or without the patch.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Therefore, there is still a need to find ways to optimize the reading
>>>>>>>>>>>>>>>>>>> of large amounts of stats.
>>>>>>>>>>>>>>>>>>> However, this patch is effective in its own right, and since there are
>>>>>>>>>>>>>>>>>>> only a few parts to modify,
>>>>>>>>>>>>>>>>>>> I think it should be able to be applied to current (preferably
>>>>>>>>>>>>>>>>>>> pre-v13) PostgreSQL.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> +1
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>> + /* We might be better to refresh stats */
>>>>>>>>>>>>>>>>>> + use_existing_stats = false;
>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>> + else
>>>>>>>>>>>>>>>>>> + {
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> - heap_freetuple(classTup);
>>>>>>>>>>>>>>>>>> + heap_freetuple(classTup);
>>>>>>>>>>>>>>>>>> + /* The relid has already vacuumed, so we might be better to
>>>>>>>>>>>>>>>>>> use exiting stats */
>>>>>>>>>>>>>>>>>> + use_existing_stats = true;
>>>>>>>>>>>>>>>>>> + }
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> With that patch, the autovacuum process refreshes the stats in the
>>>>>>>>>>>>>>>>>> next check if it finds out that the table still needs to be vacuumed.
>>>>>>>>>>>>>>>>>> But I guess it's not necessarily true because the next table might be
>>>>>>>>>>>>>>>>>> vacuumed already. So I think we might want to always use the existing
>>>>>>>>>>>>>>>>>> for the first check. What do you think?
>>>>>>>>>>>>>>>>> Thanks for your comment.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> If we assume the case where some workers vacuum on large tables
>>>>>>>>>>>>>>>>> and a single worker vacuum on small tables, the processing
>>>>>>>>>>>>>>>>> performance of the single worker will be slightly lower if the
>>>>>>>>>>>>>>>>> existing statistics are checked every time.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> In fact, at first I tried to check the existing stats every time,
>>>>>>>>>>>>>>>>> but the performance was slightly worse in cases with a small number of workers.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Do you have this benchmark result?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> (Checking the existing stats is lightweight , but at high frequency,
>>>>>>>>>>>>>>>>> it affects processing performance.)
>>>>>>>>>>>>>>>>> Therefore, at after refresh statistics, determine whether autovac
>>>>>>>>>>>>>>>>> should use the existing statistics.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Yeah, since the test you used uses a lot of small tables, if there are
>>>>>>>>>>>>>>>> a few workers, checking the existing stats is unlikely to return true
>>>>>>>>>>>>>>>> (no need to vacuum). So the cost of existing stats check ends up being
>>>>>>>>>>>>>>>> overhead. Not sure how slow always checking the existing stats was,
>>>>>>>>>>>>>>>> but given that the shared memory based stats collector patch could
>>>>>>>>>>>>>>>> improve the performance of refreshing stats, it might be better not to
>>>>>>>>>>>>>>>> check the existing stats frequently like the patch does. Anyway, I
>>>>>>>>>>>>>>>> think it’s better to evaluate the performance improvement with other
>>>>>>>>>>>>>>>> cases too.
>>>>>>>>>>>>>>> Yeah, I would like to see how much the performance changes in other cases.
>>>>>>>>>>>>>>> In addition, if the shared-based-stats patch is applied, we won't need to reload
>>>>>>>>>>>>>>> a huge stats file, so we will just have to check the stats on
>>>>>>>>>>>>>>> shared-mem every time.
>>>>>>>>>>>>>>> Perhaps the logic of table_recheck_autovac could be simpler.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> BTW, I found some typos in comments, so attache a fixed version.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The patch adds some duplicated codes into table_recheck_autovac().
>>>>>>>>>>>>>> It's better to make the common function performing them and make
>>>>>>>>>>>>>> table_recheck_autovac() call that common function, to simplify the code.
>>>>>>>>>>>>> Thanks for your comment.
>>>>>>>>>>>>> Hmm.. I've cut out the duplicate part.
>>>>>>>>>>>>> Attach the patch.
>>>>>>>>>>>>> Could you confirm that it fits your expecting?
>>>>>>>>>>>>
>>>>>>>>>>>> Yes, thanks for updataing the patch! Here are another review comments.
>>>>>>>>>>>>
>>>>>>>>>>>> + shared = pgstat_fetch_stat_dbentry(InvalidOid);
>>>>>>>>>>>> + dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
>>>>>>>>>>>>
>>>>>>>>>>>> When using the existing stats, ISTM that these are not necessary and
>>>>>>>>>>>> we can reuse "shared" and "dbentry" obtained before. Right?
>>>>>>>>>>> Yeah, but unless autovac_refresh_stats() is called, these functions
>>>>>>>>>>> read the information from the
>>>>>>>>>>> local hash table without re-read the stats file, so the process is very light.
>>>>>>>>>>> Therefore, I think, it is better to keep the current logic to keep the
>>>>>>>>>>> code simple.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> + /* We might be better to refresh stats */
>>>>>>>>>>>> + use_existing_stats = false;
>>>>>>>>>>>>
>>>>>>>>>>>> I think that we should add more comments about why it's better to
>>>>>>>>>>>> refresh the stats in this case.
>>>>>>>>>>>>
>>>>>>>>>>>> + /* The relid has already vacuumed, so we might be better to use existing stats */
>>>>>>>>>>>> + use_existing_stats = true;
>>>>>>>>>>>>
>>>>>>>>>>>> I think that we should add more comments about why it's better to
>>>>>>>>>>>> reuse the stats in this case.
>>>>>>>>>>> I added comments.
>>>>>>>>>>>
>>>>>>>>>>> Attache the patch.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thank you for updating the patch. Here are some small comments on the
>>>>>>>>>> latest (v4) patch.
>>>>>>>>>>
>>>>>>>>>> + * So if the last time we checked a table that was already vacuumed after
>>>>>>>>>> + * refres stats, check the current statistics before refreshing it.
>>>>>>>>>> + */
>>>>>>>>>>
>>>>>>>>>> s/refres/refresh/
>>>>>>>> Thanks! fixed.
>>>>>>>> Attached the patch.
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> -----
>>>>>>>>>> +/* Counter to determine if statistics should be refreshed */
>>>>>>>>>> +static bool use_existing_stats = false;
>>>>>>>>>> +
>>>>>>>>>>
>>>>>>>>>> I think 'use_existing_stats' can be declared within table_recheck_autovac().
>>>>>>>>>>
>>>>>>>>>> -----
>>>>>>>>>> While testing the performance, I realized that the statistics are
>>>>>>>>>> reset every time vacuumed one table, leading to re-reading the stats
>>>>>>>>>> file even if 'use_existing_stats' is true. Please refer that vacuum()
>>>>>>>>>> eventually calls AtEOXact_PgStat() which calls to
>>>>>>>>>> pgstat_clear_snapshot().
>>>>>>>>>
>>>>>>>>> Good catch!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> I believe that's why the performance of the
>>>>>>>>>> method of always checking the existing stats wasn’t good as expected.
>>>>>>>>>> So if we save the statistics somewhere and use it for rechecking, the
>>>>>>>>>> results of the performance benchmark will differ between these two
>>>>>>>>>> methods.
>>>>>>>> Thanks for you checks.
>>>>>>>> But, if a worker did vacuum(), that means this worker had determined
>>>>>>>> need vacuum in the
>>>>>>>> table_recheck_autovac(). So, use_existing_stats set to false, and next
>>>>>>>> time, refresh stats.
>>>>>>>> Therefore I think the current patch is fine, as we want to avoid
>>>>>>>> unnecessary refreshing of
>>>>>>>> statistics before the actual vacuum(), right?
>>>>>>>
>>>>>>> Yes, you're right.
>>>>>>>
>>>>>>> When I benchmarked the performance of the method of always checking
>>>>>>> existing stats I edited your patch so that it sets use_existing_stats
>>>>>>> = true even if the second check is false (i.g., vacuum is needed).
>>>>>>> And the result I got was worse than expected especially in the case of
>>>>>>> a few autovacuum workers. But it doesn't evaluate the performance of
>>>>>>> that method rightly as the stats snapshot is cleared every time
>>>>>>> vacuum. Given you had similar results, I guess you used a similar way
>>>>>>> when evaluating it, is it right? If so, it’s better to fix this issue
>>>>>>> and see how the performance benchmark results will differ.
>>>>>>>
>>>>>>> For example, the results of the test case with 10000 tables and 1
>>>>>>> autovacuum worker I reported before was:
>>>>>>>
>>>>>>> 10000 tables:
>>>>>>> autovac_workers 1 : 158s,157s, 290s
>>>>>>>
>>>>>>> But after fixing that issue in the third method (always checking the
>>>>>>> existing stats), the results are:
>>>>>>
>>>>>> Could you tell me how you fixed that issue? You copied the stats to
>>>>>> somewhere as you suggested or skipped pgstat_clear_snapshot() as
>>>>>> I suggested?
>>>>>
>>>>> I used the way you suggested in this quick test; skipped
>>>>> pgstat_clear_snapshot().
>>>>>
>>>>>>
>>>>>> Kasahara-san seems not to like the latter idea because it might
>>>>>> cause bad side effect. So we should use the former idea?
>>>>>
>>>>> Not sure. I'm also concerned about the side effect but I've not checked yet.
>>>>>
>>>>> Since probably there is no big difference between the two ways in
>>>>> terms of performance I'm going to see how the performance benchmark
>>>>> result will change first.
>>>>
>>>> I've tested performance improvement again. From the left the execution
>>>> time of the current HEAD, Kasahara-san's patch, the method of always
>>>> checking the existing stats (using approach suggested by Fujii-san),
>>>> in seconds.
>>>>
>>>> 1000 tables:
>>>> autovac_workers 1 : 13s, 13s, 13s
>>>> autovac_workers 2 : 6s, 4s, 4s
>>>> autovac_workers 3 : 3s, 4s, 3s
>>>> autovac_workers 5 : 3s, 3s, 2s
>>>> autovac_workers 10: 2s, 3s, 2s
>>>>
>>>> 5000 tables:
>>>> autovac_workers 1 : 71s, 71s, 72s
>>>> autovac_workers 2 : 37s, 32s, 32s
>>>> autovac_workers 3 : 29s, 26s, 26s
>>>> autovac_workers 5 : 20s, 19s, 18s
>>>> autovac_workers 10: 13s, 8s, 8s
>>>>
>>>> 10000 tables:
>>>> autovac_workers 1 : 158s,157s, 159s
>>>> autovac_workers 2 : 80s, 53s, 78s
>>>> autovac_workers 3 : 75s, 67s, 67s
>>>> autovac_workers 5 : 61s, 42s, 42s
>>>> autovac_workers 10: 69s, 26s, 25s
>>>>
>>>> 20000 tables:
>>>> autovac_workers 1 : 379s, 380s, 389s
>>>> autovac_workers 2 : 236s, 232s, 233s
>>>> autovac_workers 3 : 222s, 181s, 182s
>>>> autovac_workers 5 : 212s, 132s, 139s
>>>> autovac_workers 10: 317s, 91s, 89s
>>>>
>>>> I don't see a big difference between Kasahara-san's patch and the
>>>> method of always checking the existing stats.
>>>
>>> Thanks for doing the benchmark!
>>>
>>> This benchmark result makes me think that we don't need to tweak
>>> AtEOXact_PgStat() and can use Kasahara-san approach.
>>> That's good news :)
>>
>> Yeah, given that all autovaucum workers have the list of tables to
>> vacuum in the same order in most cases, the assumption in
>> Kasahara-san’s patch that if a worker needs to vacuum a table it’s
>> unlikely that it will be able to skip the next table using the current
>> snapshot of stats makes sense to me.
>>
>> One small comment on v6 patch:
>>
>> + /* When we decide to do vacuum or analyze, the existing stats cannot
>> + * be reused in the next cycle because it's cleared at the end of vacuum
>> + * or analyze (by AtEOXact_PgStat()).
>> + */
>> + use_existing_stats = false;
>>
>> I think the comment should start on the second line (i.g., \n is
>> needed after /*).
> Oops, thanks.
> Fixed.

Thanks for updating the patch!

I applied the following cosmetic changes to the patch.
Attached is the updated version of the patch.
Coud you review this version?

- Ran pgindent to fix some warnings that "git diff --check"
reported on the patch.
- Made the order of arguments consistent between
recheck_relation_needs_vacanalyze and relation_needs_vacanalyze.
- Renamed the variable use_existing_stats to reuse_stats for simplicity.
- Added more comments.

Barring any objection, I'm thinking to commit this version.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Attachment	Content-Type	Size
v8_mod_table_recheck_autovac.patch	text/plain	4.9 KB

In response to

Re: autovac issue with large number of tables at 2020-12-03 02:46:09 from Kasahara Tatsuhito

Responses

Re: autovac issue with large number of tables at 2020-12-04 03:21:44 from Kasahara Tatsuhito

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Peter Eisentraut	2020-12-03 12:14:11	Re: SELECT INTO deprecation
Previous Message	Amit Kapila	2020-12-03 12:04:27	Remove incorrect assertion in reorderbuffer.c.