Re: Unlogged relation copy is not fsync'd

From: Noah Misch <noah(at)leadboat(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Unlogged relation copy is not fsync'd
Date: 2023-10-08 02:22:04
Message-ID: 20231008022204.cc@rfd.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Sep 20, 2023 at 11:22:10PM -0700, Noah Misch wrote:
> On Fri, Sep 15, 2023 at 02:47:45PM +0300, Heikki Linnakangas wrote:
> > On 05/09/2023 21:20, Robert Haas wrote:
>
> > Thinking about this some more, I think this is still not 100% correct, even
> > with the patch I posted earlier:
> >
> > > /*
> > > * When we WAL-logged rel pages, we must nonetheless fsync them. The
> > > * reason is that since we're copying outside shared buffers, a CHECKPOINT
> > > * occurring during the copy has no way to flush the previously written
> > > * data to disk (indeed it won't know the new rel even exists). A crash
> > > * later on would replay WAL from the checkpoint, therefore it wouldn't
> > > * replay our earlier WAL entries. If we do not fsync those pages here,
> > > * they might still not be on disk when the crash occurs.
> > > */
> > > if (use_wal || relpersistence == RELPERSISTENCE_UNLOGGED)
> > > smgrimmedsync(dst, forkNum);
> >
> > Let's walk through each case one by one:
> >
> > 1. Temporary table. No fsync() needed. This is correct.
> >
> > 2. Unlogged rel, main fork. Needs to be fsync'd, because we skipped WAL, and
> > also because we bypassed the buffer manager. Correct.
>
> Agreed.
>
> > 3. Unlogged rel, init fork. Needs to be fsync'd, even though we WAL-logged
> > it, because we bypassed the buffer manager. Like the comment explains. This
> > is now correct with the patch.
>
> This has two subtypes:
>
> 3a. Unlogged rel, init fork, use_wal (wal_level!=minimal). Matches what
> you wrote.
>
> 3b. Unlogged rel, init fork, !use_wal (wal_level==minimal). Needs to be
> fsync'd because we didn't WAL-log it and RelationCreateStorage() won't call
> AddPendingSync(). (RelationCreateStorage() could start calling
> AddPendingSync() for this case. I think we chose not to do that because there
> will never be additional writes to the init fork, and smgrDoPendingSyncs()
> would send the fork to FlushRelationsAllBuffers() even though the fork will
> never appear in shared buffers. On the other hand, grouping the sync with the
> other end-of-xact syncs could improve efficiency for some filesystems. Also,
> the number of distinguishable cases is unpleasantly high.)
>
> > 4. Permanent rel, use_wal == true. Needs to be fsync'd, even though we
> > WAL-logged it, because we bypassed the buffer manager. Like the comment
> > explains. Correct.
> >
> > 5. Permanent rel, use_wal == false. We skip fsync, because it means that
> > it's a new relation, so we have a sync request pending for it. (An assertion
> > for that would be nice). At the end of transaction, in smgrDoPendingSyncs(),
> > we will either fsync it or we will WAL-log all the pages if it was a small
> > relation. I believe this is *wrong*. If smgrDoPendingSyncs() decides to
> > WAL-log the pages, we have the same race condition that's explained in the
> > comment, because we bypassed the buffer manager:
> >
> > 1. RelationCopyStorage() copies some of the pages.
> > 2. Checkpoint happens, which fsyncs the relation (smgrcreate() registered a
> > dirty segment when the relation was created)
> > 3. RelationCopyStorage() copies the rest of the pages.
> > 4. smgrDoPendingSyncs() WAL-logs all the pages.
>
> smgrDoPendingSyncs() delegates to log_newpage_range(). log_newpage_range()
> loads each page into the buffer manager and calls MarkBufferDirty() on each.
> Hence, ...
>
> > 5. Another checkpoint happens. It does *not* fsync the relation.
>
> ... I think this will fsync the relation. No?
>
> > 6. Crash.

While we're cataloging gaps, I think the middle sentence is incorrect in the
following heapam_relation_set_new_filelocator() comment:

/*
* If required, set up an init fork for an unlogged table so that it can
* be correctly reinitialized on restart. Recovery may remove it while
* replaying, for example, an XLOG_DBASE_CREATE* or XLOG_TBLSPC_CREATE
* record. Therefore, logging is necessary even if wal_level=minimal.
*/
if (persistence == RELPERSISTENCE_UNLOGGED)
{
Assert(rel->rd_rel->relkind == RELKIND_RELATION ||
rel->rd_rel->relkind == RELKIND_MATVIEW ||
rel->rd_rel->relkind == RELKIND_TOASTVALUE);
smgrcreate(srel, INIT_FORKNUM, false);
log_smgrcreate(newrlocator, INIT_FORKNUM);
}

XLOG_DBASE_CREATE_FILE_COPY last had that problem before fbcbc5d (2005-06)
made it issue a checkpoint. XLOG_DBASE_CREATE_WAL_LOG never had that problem.
XLOG_TBLSPC_CREATE last had that problem before 97ddda8a82 (2021-08). In
general, if we reintroduced such a bug, it would affect all new rels under
wal_level=minimal, not just init forks. Having said all that,
log_smgrcreate() calls are never conditional on wal_level=minimal; the above
code is correct.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message vignesh C 2023-10-08 02:52:28 Re: Invalidate the subscription worker in cases where a user loses their superuser status
Previous Message Andy Fan 2023-10-08 01:33:50 Re: Draft LIMIT pushdown to Append and MergeAppend patch