Re: data corruption hazard in reorderbuffer.c

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: data corruption hazard in reorderbuffer.c
Date: 2021-07-16 08:21:46
Message-ID: 7d1ccf60-8a86-76b2-2831-d2b8a4554007@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 7/16/21 1:07 AM, Mark Dilger wrote:
>
>
>> On Jul 15, 2021, at 3:32 PM, Tomas Vondra
>> <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>>
>> I think it's mostly futile to list all the possible issues this
>> might have caused - if you skip arbitrary decoded changes, that can
>> trigger pretty much any bug in reorder buffer. But those bugs can
>> be triggered by various other issues, of course.
>
> I thought that at first, too, but when I searched for bug reports
> with the given error messages, they all had to do with logical
> replication or logical decoding. That seems a bit fishy to me. If
> these bugs were coming from all over the system, why would that be
> so?
>

No, I'm not suggesting those bugs are coming from all over the system.
The theory is that there's a bug in logical decoding / reorder buffer,
with the same symptoms. So it'd affect systems with logical replication.

>> It's hard to say what was the cause, but the "logic" bugs are
>> probably permanent, while the issues triggered by I/O probably
>> disappear after a restart?
>
> If you mean "logic" bugs like passing an invalid file descriptor to
> close(), then yes, those would be permanent, but I don't believe we
> have any bugs of that kind.
>

No, by logic bugs I mean cases like failing to update the snapshot,
losing track of relfilenode changes, etc. due to failing to consider
some cases etc. As opposed to "I/O errors" where this is caused by
external events.

> The consequences of I/O errors are not going to go away after
> restart. On the subscriber side, if logical replication has replayed
> less than a full transaction worth of changes without raising an
> error, the corruption will be permanent.
>

True, good point. I was thinking about the "could not map relfilenode"
error, which forces the decoding to restart, and on the retry it's
unlikely to hit the same I/O error, so it'll succeed. But you're right
that if it reaches the subscriber and gets committed (e.g. missing a
row), it's permanent.

>> That being said, I agree this seems like an issue and we should not
>> ignore I/O errors.
>
> I agree.
>
>> I'd bet other places using transient files (like sorting or hashagg
>> spilling to disk) has the same issue, although in that case the
>> impact is likely limited to a single query.
>
> I'm not so convinced. In most places, the result is ignored for
> close()-type operations only when the prior operation failed and
> we're closing the handle in preparation for raising an error. There
> are a small number of other places, such as
> pg_import_system_collations and perform_base_backup, which ignore the
> result from closing a handle, but those are reading from the handle,
> not writing to it, so the situation is not comparable.
>
> I believe the oversight in reorderbuffer.c really is a special case.
>

Hmm, you might be right. I have only briefly looked at buffile.c and
tuplestore.c yesterday, and I wasn't sure about the error checking. But
maybe that works fine, now that I look at it, because we don't really
use the files after close().

>> I wonder if sync before the close is an appropriate solution,
>> though. It seems rather expensive, and those files are meant to be
>> "temporary" (i.e. we don't keep them over restart). So maybe we
>> could ensure the consistency is a cheaper way - perhaps tracking
>> some sort of checksum for each file, or something like that?
>
> I'm open to suggestions of what to do, but thinking of these files as
> temporary may be what misleads developers to believe they don't have
> to be treated carefully. The code was first committed in 2014 and as
> far as I am aware nobody else has complained about this before. In
> some part that might be because CloseTemporaryFile() is less familiar
> than close() to most developers, and they may be assuming that it
> contains its own error handling and just don't realize that it
> returns an error code just like regular close() does.
>

I don't think anyone thinks corruption of temporary files would not be
an issue, but making them fully persistent (which is what fsync would
do) just to eliminate a piece is not lost seems like an overkill to me.
And the file may get corrupted even with fsync(), so I'm wondering if
there's a way to ensure integrity without the fsync (so cheaper).

The scheme I was thinking about is a simple checksum for each BufFile.
When writing a buffer to disk, update the crc32c checksum, and then
check it after reading the file (and error-out if it disagrees).

> The point of the reorder buffer is to sort the changes from a single
> transaction so that they can be replayed in order. The files used
> for the sorting are temporary, but there is nothing temporary about
> the failure to include all the changes in the files, as that will
> have permanent consequences for whoever replays them. If lucky, the
> attempt to replay the changes will abort because they don't make
> sense, but I've demonstrated to my own satisfaction that nothing
> prevents silent data loss if the failure to write changes happens to
> destroy a complete rather than partial change.
>

Sure, I agree with all of that.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Japin Li 2021-07-16 08:35:25 Re: Remove redundant strlen call in ReplicationSlotValidateName
Previous Message Drouvot, Bertrand 2021-07-16 08:07:10 Re: Minimal logical decoding on standbys