Unlogged relations and WAL-logging

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Unlogged relations and WAL-logging
Date: 2022-01-27 19:32:04
Message-ID: 6e5bbc08-cdfc-b2b3-9e23-1a914b9850a9@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Unlogged relations are not WAL-logged, but creating the init-fork is.
There are a few things around that seem sloppy:

1. In index_build(), we do this:

>
> /*
> * If this is an unlogged index, we may need to write out an init fork for
> * it -- but we must first check whether one already exists. If, for
> * example, an unlogged relation is truncated in the transaction that
> * created it, or truncated twice in a subsequent transaction, the
> * relfilenode won't change, and nothing needs to be done here.
> */
> if (indexRelation->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED &&
> !smgrexists(RelationGetSmgr(indexRelation), INIT_FORKNUM))
> {
> smgrcreate(RelationGetSmgr(indexRelation), INIT_FORKNUM, false);
> indexRelation->rd_indam->ambuildempty(indexRelation);
> }

Shouldn't we call log_smgrcreate() here? Creating the init fork is
otherwise not WAL-logged at all. In practice, all the ambuildempty()
implementations create and WAL-log a metapage and replay of that will
implicitly create the underlying relation. But if you created an index
access method whose ambuildempty() function does nothing, i.e. the init
fork is just an empty file, there would be no trace in the WAL about
creating the init fork, which is bad. In fact,
src/test/modules/dummy_index_am is an example of that, but because it
does nothing at all with the file, you cannot see any ill effect from
the missing init fork.

2. Some implementations of ambuildempty() use the buffer cache (hash,
gist, gin, brin), while others bypass it and call smgrimmedsync()
instead (btree, spgist, bloom). I don't see any particular reason for
those decisions, it seems to be based purely on which example the author
happened to copy-paste.

Using the buffer cache seems better to me, because it avoids the
smgrimmedsync() call. That makes creating unlogged indexes faster.
That's not significant for any real world application, but still.

3. Those ambuildempty implementations that bypass the buffer cache use
smgrwrite() to write the pages. That doesn't make any difference in
practice, but in principle it's wrong: You are supposed to use
smgrextend() when extending a relation. Using smgrwrite() skips updating
the relation size cache, which is harmless in this case because it
wasn't initialized previously either, and I'm not sure if we ever call
smgrnblocks() on the init-fork. But if you build with
CHECK_WRITE_VS_EXTEND, you get an assertion failure.

4. Also, the smgrwrite() calls are performed before WAL-logging the
pages, so the page that's written to disk has 0/0 as the LSN, not the
LSN of the WAL record. That's harmless too, but seems a bit sloppy.

(I remember we had a discussion once whether we should always extend the
relation first, and WAL-log only after that, but I can't find the thread
right now. The point is to avoid the situation that the operation fails
because you run out of disk space, and then you crash and WAL replay
also fails because you are still out of disk space. But most places
currently call log_newpage() first, and smgrextend() after that.)

5. In heapam_relation_set_new_filenode(), we do this:

>
> /*
> * If required, set up an init fork for an unlogged table so that it can
> * be correctly reinitialized on restart. An immediate sync is required
> * even if the page has been logged, because the write did not go through
> * shared_buffers and therefore a concurrent checkpoint may have moved the
> * redo pointer past our xlog record. Recovery may as well remove it
> * while replaying, for example, XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE
> * record. Therefore, logging is necessary even if wal_level=minimal.
> */
> if (persistence == RELPERSISTENCE_UNLOGGED)
> {
> Assert(rel->rd_rel->relkind == RELKIND_RELATION ||
> rel->rd_rel->relkind == RELKIND_MATVIEW ||
> rel->rd_rel->relkind == RELKIND_TOASTVALUE);
> smgrcreate(srel, INIT_FORKNUM, false);
> log_smgrcreate(newrnode, INIT_FORKNUM);
> smgrimmedsync(srel, INIT_FORKNUM);
> }

The comment doesn't make much sense, we haven't written nor WAL-logged
any page here, with nor without the buffer cache. It made more sense
before commit fa0f466d53.

This is what actually led me to discover the bug I just reported at [1],
with regular tables. If we fix that bug I like I proposed there, then
smgrcreate() will register and fsync request, and we won't need
smgrimmedsync here anymore.

Attached is a patch to tighten those up. The third one should arguably
be part of [1], not this thread, but here it goes too.

[1]
https://www.postgresql.org/message-id/d47d8122-415e-425c-d0a2-e0160829702d%40iki.fi

- Heikki

Attachment Content-Type Size
0001-WAL-log-the-creation-of-the-init-fork-in-index_build.patch text/x-patch 1.3 KB
0002-Use-the-buffer-cache-when-initializing-an-unlogged-i.patch text/x-patch 10.0 KB
0003-Remove-unnecessary-smgrimmedsync-when-creating-unlog.patch text/x-patch 2.4 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2022-01-27 19:33:46 Re: Bug in ProcArrayApplyRecoveryInfo for snapshots crossing 4B, breaking replicas
Previous Message Andres Freund 2022-01-27 19:17:13 Re: Creation of an empty table is not fsync'd at checkpoint