| From: | Noah Misch <noah(at)leadboat(dot)com> | 
|---|---|
| To: | Heikki Linnakangas <hlinnaka(at)iki(dot)fi> | 
| Cc: | Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> | 
| Subject: | Re: Unlogged relation copy is not fsync'd | 
| Date: | 2023-10-08 02:22:04 | 
| Message-ID: | 20231008022204.cc@rfd.leadboat.com | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-hackers | 
On Wed, Sep 20, 2023 at 11:22:10PM -0700, Noah Misch wrote:
> On Fri, Sep 15, 2023 at 02:47:45PM +0300, Heikki Linnakangas wrote:
> > On 05/09/2023 21:20, Robert Haas wrote:
> 
> > Thinking about this some more, I think this is still not 100% correct, even
> > with the patch I posted earlier:
> > 
> > >	/*
> > >	 * When we WAL-logged rel pages, we must nonetheless fsync them.  The
> > >	 * reason is that since we're copying outside shared buffers, a CHECKPOINT
> > >	 * occurring during the copy has no way to flush the previously written
> > >	 * data to disk (indeed it won't know the new rel even exists).  A crash
> > >	 * later on would replay WAL from the checkpoint, therefore it wouldn't
> > >	 * replay our earlier WAL entries. If we do not fsync those pages here,
> > >	 * they might still not be on disk when the crash occurs.
> > >	 */
> > >	if (use_wal || relpersistence == RELPERSISTENCE_UNLOGGED)
> > >		smgrimmedsync(dst, forkNum);
> > 
> > Let's walk through each case one by one:
> > 
> > 1. Temporary table. No fsync() needed. This is correct.
> > 
> > 2. Unlogged rel, main fork. Needs to be fsync'd, because we skipped WAL, and
> > also because we bypassed the buffer manager. Correct.
> 
> Agreed.
> 
> > 3. Unlogged rel, init fork. Needs to be fsync'd, even though we WAL-logged
> > it, because we bypassed the buffer manager. Like the comment explains. This
> > is now correct with the patch.
> 
> This has two subtypes:
> 
> 3a. Unlogged rel, init fork, use_wal (wal_level!=minimal). Matches what
> you wrote.
> 
> 3b. Unlogged rel, init fork, !use_wal (wal_level==minimal). Needs to be
> fsync'd because we didn't WAL-log it and RelationCreateStorage() won't call
> AddPendingSync().  (RelationCreateStorage() could start calling
> AddPendingSync() for this case.  I think we chose not to do that because there
> will never be additional writes to the init fork, and smgrDoPendingSyncs()
> would send the fork to FlushRelationsAllBuffers() even though the fork will
> never appear in shared buffers.  On the other hand, grouping the sync with the
> other end-of-xact syncs could improve efficiency for some filesystems.  Also,
> the number of distinguishable cases is unpleasantly high.)
> 
> > 4. Permanent rel, use_wal == true. Needs to be fsync'd, even though we
> > WAL-logged it, because we bypassed the buffer manager. Like the comment
> > explains. Correct.
> > 
> > 5. Permanent rel, use_wal == false. We skip fsync, because it means that
> > it's a new relation, so we have a sync request pending for it. (An assertion
> > for that would be nice). At the end of transaction, in smgrDoPendingSyncs(),
> > we will either fsync it or we will WAL-log all the pages if it was a small
> > relation. I believe this is *wrong*. If smgrDoPendingSyncs() decides to
> > WAL-log the pages, we have the same race condition that's explained in the
> > comment, because we bypassed the buffer manager:
> > 
> > 1. RelationCopyStorage() copies some of the pages.
> > 2. Checkpoint happens, which fsyncs the relation (smgrcreate() registered a
> > dirty segment when the relation was created)
> > 3. RelationCopyStorage() copies the rest of the pages.
> > 4. smgrDoPendingSyncs() WAL-logs all the pages.
> 
> smgrDoPendingSyncs() delegates to log_newpage_range().  log_newpage_range()
> loads each page into the buffer manager and calls MarkBufferDirty() on each.
> Hence, ...
> 
> > 5. Another checkpoint happens. It does *not* fsync the relation.
> 
> ... I think this will fsync the relation.  No?
> 
> > 6. Crash.
While we're cataloging gaps, I think the middle sentence is incorrect in the
following heapam_relation_set_new_filelocator() comment:
	/*
	 * If required, set up an init fork for an unlogged table so that it can
	 * be correctly reinitialized on restart.  Recovery may remove it while
	 * replaying, for example, an XLOG_DBASE_CREATE* or XLOG_TBLSPC_CREATE
	 * record.  Therefore, logging is necessary even if wal_level=minimal.
	 */
	if (persistence == RELPERSISTENCE_UNLOGGED)
	{
		Assert(rel->rd_rel->relkind == RELKIND_RELATION ||
			   rel->rd_rel->relkind == RELKIND_MATVIEW ||
			   rel->rd_rel->relkind == RELKIND_TOASTVALUE);
		smgrcreate(srel, INIT_FORKNUM, false);
		log_smgrcreate(newrlocator, INIT_FORKNUM);
	}
XLOG_DBASE_CREATE_FILE_COPY last had that problem before fbcbc5d (2005-06)
made it issue a checkpoint.  XLOG_DBASE_CREATE_WAL_LOG never had that problem.
XLOG_TBLSPC_CREATE last had that problem before 97ddda8a82 (2021-08).  In
general, if we reintroduced such a bug, it would affect all new rels under
wal_level=minimal, not just init forks.  Having said all that,
log_smgrcreate() calls are never conditional on wal_level=minimal; the above
code is correct.
| From | Date | Subject | |
|---|---|---|---|
| Next Message | vignesh C | 2023-10-08 02:52:28 | Re: Invalidate the subscription worker in cases where a user loses their superuser status | 
| Previous Message | Andy Fan | 2023-10-08 01:33:50 | Re: Draft LIMIT pushdown to Append and MergeAppend patch |