Re: Unlogged relation copy is not fsync'd

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Unlogged relation copy is not fsync'd
Date: 2023-09-15 11:47:45
Message-ID: 58effc10-c160-b4a6-4eb7-384e95e6f9e3@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 05/09/2023 21:20, Robert Haas wrote:
> In other words, somehow it feels like we ought to be trying to defer
> the fsync here until a clean shutdown actually occurs, instead of
> performing it immediately.

+1

> Admittedly, the bookkeeping seems like a problem, so maybe this is
> the best we can do, but it's clearly worse than what we do in other
> cases.
I think we can do it, I have been working on a patch along those lines
on the side. But I want to focus on a smaller, backpatchable fix in this
thread.

Thinking about this some more, I think this is still not 100% correct,
even with the patch I posted earlier:

> /*
> * When we WAL-logged rel pages, we must nonetheless fsync them. The
> * reason is that since we're copying outside shared buffers, a CHECKPOINT
> * occurring during the copy has no way to flush the previously written
> * data to disk (indeed it won't know the new rel even exists). A crash
> * later on would replay WAL from the checkpoint, therefore it wouldn't
> * replay our earlier WAL entries. If we do not fsync those pages here,
> * they might still not be on disk when the crash occurs.
> */
> if (use_wal || relpersistence == RELPERSISTENCE_UNLOGGED)
> smgrimmedsync(dst, forkNum);

Let's walk through each case one by one:

1. Temporary table. No fsync() needed. This is correct.

2. Unlogged rel, main fork. Needs to be fsync'd, because we skipped WAL,
and also because we bypassed the buffer manager. Correct.

3. Unlogged rel, init fork. Needs to be fsync'd, even though we
WAL-logged it, because we bypassed the buffer manager. Like the comment
explains. This is now correct with the patch.

4. Permanent rel, use_wal == true. Needs to be fsync'd, even though we
WAL-logged it, because we bypassed the buffer manager. Like the comment
explains. Correct.

5. Permanent rel, use_wal == false. We skip fsync, because it means that
it's a new relation, so we have a sync request pending for it. (An
assertion for that would be nice). At the end of transaction, in
smgrDoPendingSyncs(), we will either fsync it or we will WAL-log all the
pages if it was a small relation. I believe this is *wrong*. If
smgrDoPendingSyncs() decides to WAL-log the pages, we have the same race
condition that's explained in the comment, because we bypassed the
buffer manager:

1. RelationCopyStorage() copies some of the pages.
2. Checkpoint happens, which fsyncs the relation (smgrcreate()
registered a dirty segment when the relation was created)
3. RelationCopyStorage() copies the rest of the pages.
4. smgrDoPendingSyncs() WAL-logs all the pages.
5. Another checkpoint happens. It does *not* fsync the relation.
6. Crash.

WAL replay will not see the WAL-logged pages, because they were
WAL-logged before the last checkpoint. And the contents were not fsync'd
either.

In other words, we must do smgrimmedsync() here for permanent relations,
even if use_wal==false, because we bypassed the buffer manager. Same
reason we need to do it with use_wal==true.

For a backportable fix, I think we should change this to only exempt
temporary tables, and call smgrimmedsync() for all other cases. Maybe
would be safe to skip it in some cases, but it feels too dangerous to be
clever here. The other similar callers of smgrimmedsync() in nbtsort.c,
gistbuild.c, and rewriteheap.c, need similar treatment.

--
Heikki Linnakangas
Neon (https://neon.tech)

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nikita Malakhov 2023-09-15 12:13:10 Re: POC: Extension for adding distributed tracing - pg_tracing
Previous Message Aleksander Alekseev 2023-09-15 11:32:44 Re: POC: Extension for adding distributed tracing - pg_tracing