Too many autovacuum workers spawned during forced auto-vacuum

From: Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Too many autovacuum workers spawned during forced auto-vacuum
Date: 2017-01-13 11:38:05
Message-ID: CAJ3gD9eWejf72HKquKSzax0r+epS=nAbQKNnykkMA0E8c+rMDg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

In a server where autovacuum is disabled and its databases reach
autovacuum_freeze_max_age limit, an autovacuum is forced to prevent
xid wraparound issues. At this stage, when the server is loaded with a
lot of DML operations, an exceedingly high number of autovacuum
workers keep on getting spawned, and these do not do anything, and
then quit.

The issue is : Sometimes an auto-vaccuum worker A1 finds it has no
tables to vacuum because of another worker(s) B already concurrently
vacuuming the tables. Then this worker A1 calls
vac_update_datfrozenxid() in the end, which effectively requests for
new auto-vacuum (vac_update_datfrozenxid() -> vac_truncate_clog() ->
SetTransactionIdLimit -> Send PMSIGNAL_START_AUTOVAC_LAUNCHER signal).
*Immediately* a new auto-vacuum is spawned since it is an
xid-wraparound vacuum. This new autovacuum chooses the same earlier
database for the new worker A2, because the datfrozenxid's aren't yet
updated due to worker B still not done yet. Worker A2 finds that the
same worker(s) B are still vacuuming the same tables. A2 again calls
vac_update_datfrozenxid(), which leads to spawning another auto-vacuum
worker A3 , and the cycle keeps on repeating until B finishes
vacuuming and updates datfrozenid.

Steps to reproduce :

1. In a fresh cluster, create tables using attached create.sql.gz

2. Insert some data :
INSERT into pgbench_history select generate_series(1, 5402107, 1);
UPDATE pgbench_history set bid = 1, aid = 11100, delta = 500, mtime =
'2017/1/1', filler = '19:00' ;

3. Set these GUCs in postgresql.conf and restart the server :
autovacuum = off
autovacuum_freeze_max_age = 100000 # Make auto-vacuum start as early
as possible.

4. Run : "pgbench -n -c 5 -t 2000000" and watch the log file.

5. After the age(datfrozenxid) of the databases crosses
autovacuum_freeze_max_age value, after 1-2 minutes more, a number of
messages will be seen, such as this :
2017-01-13 14:50:12.304 IST [111811] LOG: autovacuum launcher started
2017-01-13 14:50:12.346 IST [111816] LOG: autovacuum launcher started
2017-01-13 14:50:12.825 IST [111818] LOG: autovacuum launcher started

I see around 70 messages per second.

=== Fix ===

For fixing the issue, one approach I thought was to make
do_start_worker() choose a different database after determining that
the earlier database is still being scanned by one of the workers in
AutoVacuumShmem->av_runningWorkers. The current logic is that if the
databases need xid-wraparound prevention, then we pick the one with
oldest datfrozenxid. So, for the fix, just choose the db with the
second oldest datfrozenxid if the oldest one is still being processed.
But this does not solve the problem. Once the other databases are
done, or if there's a single database, the workers will again go in a
loop with the same database.

Instead, the attached patch (prevent_useless_vacuums.patch) prevents
the repeated cycle by noting that there's no point in doing whatever
vac_update_datfrozenxid() does, if we didn't find anything to vacuum
and there's already another worker vacuuming the same database. Note
that it uses wi_tableoid field to check concurrency. It does not use
wi_dboid field to check for already-processing worker, because using
this field might cause each of the workers to think that there is some
other worker vacuuming, and eventually no one vacuums. We have to be
certain that the other worker has already taken a table to vacuum.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachment Content-Type Size
prevent_useless_vacuums.patch application/octet-stream 1.7 KB
create.sql.gz application/x-gzip 673.0 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Marko Tiikkaja 2017-01-13 11:55:58 Re: plpgsql - additional extra checks
Previous Message Fabien COELHO 2017-01-13 11:27:57 Re: pgbench - allow backslash continuations in \set expressions