SimpleLruTruncate() mutual exclusion

From: Noah Misch <noah(at)leadboat(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: SimpleLruTruncate() mutual exclusion
Date: 2019-02-18 07:31:03
Message-ID: 20190218073103.GA1434723@rfd.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I'm forking this thread from
https://postgr.es/m/flat/20190202083822(dot)GC32531(at)gust(dot)leadboat(dot)com, which
reported a race condition involving the "apparent wraparound" safety check in
SimpleLruTruncate():

On Wed, Feb 13, 2019 at 11:26:23PM -0800, Noah Misch wrote:
> 1. The result of the test is valid only until we release the SLRU ControlLock,
> which we do before SlruScanDirCbDeleteCutoff() uses the cutoff to evaluate
> segments for deletion. Once we release that lock, latest_page_number can
> advance. This creates a TOCTOU race condition, allowing excess deletion:
>
> [local] test=# table trunc_clog_concurrency ;
> ERROR: could not access status of transaction 2149484247
> DETAIL: Could not open file "pg_xact/0801": No such file or directory.

> Fixes are available:

> b. Arrange so only one backend runs vac_truncate_clog() at a time. Other than
> AsyncCtl, every SLRU truncation appears in vac_truncate_clog(), in a
> checkpoint, or in the startup process. Hence, also arrange for only one
> backend to call SimpleLruTruncate(AsyncCtl) at a time.

More specifically, restrict vac_update_datfrozenxid() to one backend per
database, and restrict vac_truncate_clog() and asyncQueueAdvanceTail() to one
backend per cluster. This, attached, was rather straightforward.

I wonder about performance in a database with millions of small relations,
particularly considering my intent to back-patch this. In such databases,
vac_update_datfrozenxid() can be a major part of the VACUUM's cost. Two
things work in our favor. First, vac_update_datfrozenxid() runs once per
VACUUM command, not once per relation. Second, Autovacuum has this logic:

* ... we skip
* this if (1) we found no work to do and (2) we skipped at least one
* table due to concurrent autovacuum activity. In that case, the other
* worker has already done it, or will do so when it finishes.
*/
if (did_vacuum || !found_concurrent_worker)
vac_update_datfrozenxid();

That makes me relatively unworried. I did consider some alternatives:

- Run vac_update_datfrozenxid()'s pg_class scan before taking a lock. If we
find the need for pg_database updates, take the lock and scan pg_class again
to get final numbers. This doubles the work in cases that end up taking the
lock, so I'm not betting it being a net win.

- Use LockWaiterCount() to skip vac_update_datfrozenxid() if some other
backend is already waiting. This is similar to Autovacuum's
found_concurrent_worker test. It is tempting. I'm not proposing it,
because it changes the states possible when manual VACUUM completes. Today,
you can predict post-VACUUM datfrozenxid from post-VACUUM relfrozenxid
values. If manual VACUUM could skip vac_update_datfrozenxid() this way,
datfrozenxid could lag until some concurrent VACUUM finishes.

Thanks,
nm

Attachment Content-Type Size
wrap-limits-mutex-v1.patch text/plain 13.1 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Etsuro Fujita 2019-02-18 07:33:39 Re: Problems with plan estimates in postgres_fdw
Previous Message Etsuro Fujita 2019-02-18 07:20:57 Re: Problems with plan estimates in postgres_fdw