smgr vs DropRelFileNodeBuffers() vs filesystem state vs no critical section

From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org, Daniel Wood <hexexpert(at)comcast(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: smgr vs DropRelFileNodeBuffers() vs filesystem state vs no critical section
Date: 2019-12-07 00:12:32
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


[1] made me notice these issues. The issues here are mostly independent,
but it's still worthwhile to read that thread - in particular because my
proposed solution for the problem is possibly somewhat related to this
issue. And, if we were to go for my more extreme proposal below, the fix
for this issue would probably also fix [1].

There's several smgr operations (e.g. smgrtruncate(), smgrdounlinkall())
that first do a a DropRelFileNodesAllBuffers() and then perform the
filesystem operation. Without a critical section.

As far as I can tell that's not acceptable.

Using smgrtruncate() as an example (because it's executed from backends
/ autovacuum rather than e.g. checkpointer, and because it's targeted at
relations that live past the current transaction):

DropRelFileNodeBuffers() throws away valid buffer contents. Therefore,
we'll be in a corrupt state should smgr_truncate fail. The on-disk state
will be the old file contents, but the in-memory state will not reflect
that. Consider e.g. the case that autovacuum pruned away all tuples from
a relation. That will result in pg_class.relfrozenxid being updated,
reflecting the new horizon. If we then do a DropRelFileNodeBuffers()
throwing away the modified buffer contents, but subsequently fail to
truncate the underlying file, we'll potentially have a lot of pages full
of tuples referencing xids that are older than relfrozenxid. Which scans
will try to return, as the underlying file is the original size.

As far as I can tell it, as things stand, is simply never ok to do
DropRelFileNodeBuffers() - when there potentially are dirty pages -
outside of a critical section.

RelationTruncate() notes:
* We WAL-log the truncation before actually truncating, which means
* trouble if the truncation fails. If we then crash, the WAL replay
* likely isn't going to succeed in the truncation either, and cause a
* PANIC. It's tempting to put a critical section here, but that cure
* would be worse than the disease. It would turn a usually harmless
* failure to truncate, that might spell trouble at WAL replay, into a
* certain PANIC.

but I think this analysis is quite insufficient. As far as I can tell it
doesn't consider the issue outlined above, nor does it consider what
will happen if standbys/PITR will replay the WAL records, but the
primary considers the relation to be of a different length.

It seems to me that we either need to

a) write out all dirty buffers during DropRelFileNodeBuffers(), thereby
preventing the issue of old page contents "coming back to live" after
a failed truncation.
b) accept using a critical section, with the obvious consequence that
failing to truncate would lead to a PANIC
c) use a more complex protocol to invalidate buffers, ensuring there's
no inconsistency between fs and shared_buffers.

c) could e.g. be something like

1) mark all buffers as about-to-be-dropped
2) CacheInvalidateSmgr()
3) truncate on filesystem level
4a) if that fails, remove the about-to-be-dropped flags, in a PG_CATCH block
4b) if that succeeds, fully remove the buffers to be dropped

When another backend wants to use a buffer marked as about-to-be-dropped
it would need to wait for that operation to finish (this afaict could
neither work the way StartBufferIO() does, nor the way
LockBufferForCleanup() does).

Such a scheme would have the advantage that truncation etc would behave
a lot more like normal buffer modifications. There'd e.g. afterwards be
an interlock between truncations and BufferSync() / checkpoints, which
imo is quite attractive.

On the other hand, the cost of DropRelFileNodeBuffers() needing to
perform an exhaustive search of shared_buffers, would be quite
painful. The single DropRelFileNodeBuffers() call is already bad
enough. I guess we could amortize that slightly, by caching the buffer
locations for a limited number of buffers - in many cases where the
search cost are problematic there won't be too many pages for individual
relations present.

For me alternative a) is prohibitively expensive, and not worth
discussing much. I think I can live with b), but I know that I'm much
less concerned with PANICs in these types of situations than others. c)
seems worth investigating, but presumably would end up being too
complicated to backpatch.

I think several DropRelFileNodeBuffers() callers besides vacuum are a
bit less concerning. E.g. when dropping a relation we do so as part of a
checkpoint, which'll trigger a PANIC IIRC. And the in-place truncation
case for TRUNCATE IIRC only applies to relations created in the same
subtransaction, which makes the failure scenario above largely moot IIRC.

Tom, I seem to recall a recent thread of yours discussing a different
approach to truncation. I wonder if there's some intersection with
that. But unfortunately my search somehow has come up with nothing so
far - do you remember enough to find the thread?


Andres Freund



Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2019-12-07 00:32:32 Re: ssl passphrase callback
Previous Message Tom Lane 2019-12-06 23:31:45 Re: Windows buildfarm members vs. new async-notify isolation test