Re: [HACKERS] WAL logging problem in 9.4.3?

From: Noah Misch <noah(at)leadboat(dot)com>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, robertmhaas(at)gmail(dot)com, michael(at)paquier(dot)xyz, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [HACKERS] WAL logging problem in 9.4.3?
Date: 2019-03-11 02:27:08
Message-ID: 20190311022708.GA2189728@rfd.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

This has been waiting for a review since October, so I reviewed it. The code
comment at PendingRelSync summarizes the design well, and I like that design.
I also liked the design in the https://postgr.es/m/559FA0BA.3080808@iki.fi
last paragraph, and I suspect it would have been no harder to back-patch. I
wonder if it would have been simpler and better, but I'm not asking anyone to
investigate that. Let's keep pursuing your current design.

This moves a shared_buffers scan and smgrimmedsync() from commands like COPY
to COMMIT. Users setting a timeout on COMMIT may need to adjust, and
log_min_duration_statement analysis will reflect the change. I feel that's
fine. (There already exist ways for COMMIT to be slow.)

On Mon, Mar 04, 2019 at 12:24:48PM +0900, Kyotaro HORIGUCHI wrote:
> --- a/src/backend/access/nbtree/nbtsort.c
> +++ b/src/backend/access/nbtree/nbtsort.c
> @@ -611,8 +611,14 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
> /* Ensure rd_smgr is open (could have been closed by relcache flush!) */
> RelationOpenSmgr(wstate->index);
>
> - /* XLOG stuff */
> - if (wstate->btws_use_wal)
> + /* XLOG stuff
> + *
> + * Even if minimal mode, WAL is required here if truncation happened after
> + * being created in the same transaction. It is not needed otherwise but
> + * we don't bother identifying the case precisely.
> + */
> + if (wstate->btws_use_wal ||
> + (blkno == BTREE_METAPAGE && BTPageGetMeta(page)->btm_root == 0))

We initialized "btws_use_wal" like this:

#define XLogIsNeeded() (wal_level >= WAL_LEVEL_REPLICA)
#define RelationNeedsWAL(relation) \
((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
wstate.btws_use_wal = XLogIsNeeded() && RelationNeedsWAL(wstate.index);

Hence, this change causes us to emit WAL for the metapage of a
RELPERSISTENCE_UNLOGGED or RELPERSISTENCE_TEMP relation. We should never do
that. If we do that for RELPERSISTENCE_TEMP, redo will write to a permanent
relfilenode. I've attached a test case for this; it is a patch that applies
on top of your v7 patches. The test checks for orphaned files after redo.

> + * If no tuple was inserted, it's possible that we are truncating a
> + * relation. We need to emit WAL for the metapage in the case. However it
> + * is not required elsewise,

Did you mean to write more words after that comma?

> --- a/src/backend/catalog/storage.c
> +++ b/src/backend/catalog/storage.c

> + * NB: after WAL-logging has been skipped for a block, we must not WAL-log
> + * any subsequent actions on the same block either. Replaying the WAL record
> + * of the subsequent action might fail otherwise, as the "before" state of
> + * the block might not match, as the earlier actions were not WAL-logged.

Good point. To participate in WAL redo properly, each "before" state must
have a distinct pd_lsn. In CREATE INDEX USING btree, the initial index build
skips WAL, but an INSERT later in the same transaction writes WAL. There,
however, each "before" state does have a distinct pd_lsn; the initial build
has pd_lsn==0, and each subsequent state has a pd_lsn driven by WAL position.
Hence, I think the CREATE INDEX USING btree behavior is fine, even though it
doesn't conform to this code comment.

I think this restriction applies only to full_page_writes=off. Otherwise, the
first WAL-logged change will find pd_lsn==0 and emit a full-page image. With
a full-page image in the record, the block's "before" state doesn't matter.
Also, one could make it safe to write WAL for a particular block by issuing
heap_sync() for the block's relation.

> +/*
> + * RelationRemovePendingSync() -- remove pendingSync entry for a relation
> + */
> +void
> +RelationRemovePendingSync(Relation rel)

What is the coding rule for deciding when to call this? Currently, only
ATExecSetTableSpace() calls this. CLUSTER doesn't call it, despite behaving
much like ALTER TABLE SET TABLESPACE behaves.

> +{
> + bool found;
> +
> + rel->pending_sync = NULL;
> + rel->no_pending_sync = true;
> + if (pendingSyncs)
> + {
> + elog(DEBUG2, "RelationRemovePendingSync: accessing hash");
> + hash_search(pendingSyncs, (void *) &rel->rd_node, HASH_REMOVE, &found);
> + }
> +}

We'd need a mechanism to un-remove the sync at subtransaction abort. My
attachment includes a test case demonstrating the consequences of that defect.
Please look for other areas that need to know about subtransactions; patch v7
had no code pertaining to subtransactions.

> + elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because sync_above is %u",

As you mention upthread, you have many debugging elog()s. These are too
detailed to include in every binary, but I do want them in the code. See
CACHE_elog() for a good example of achieving that.

> +/*
> + * Sync to disk any relations that we skipped WAL-logging for earlier.
> + */
> +void
> +smgrDoPendingSyncs(bool isCommit)
> +{
> + if (!pendingSyncs)
> + return;
> +
> + if (isCommit)
> + {
> + HASH_SEQ_STATUS status;
> + PendingRelSync *pending;
> +
> + hash_seq_init(&status, pendingSyncs);
> +
> + while ((pending = hash_seq_search(&status)) != NULL)
> + {
> + if (pending->sync_above != InvalidBlockNumber)

I'm mildly unhappy that pendingSyncs entries with "pending->sync_above ==
InvalidBlockNumber" are not sync requests at all. Those just record the fact
of a RelationTruncate() happening. If you can think of a way to improve that,
please do so. If not, it's okay.

> --- a/src/backend/utils/cache/relcache.c
> +++ b/src/backend/utils/cache/relcache.c

> @@ -412,6 +413,10 @@ AllocateRelationDesc(Form_pg_class relp)
> /* which we mark as a reference-counted tupdesc */
> relation->rd_att->tdrefcount = 1;
>
> + /* We don't know if pending sync for this relation exists so far */
> + relation->pending_sync = NULL;
> + relation->no_pending_sync = false;

RelationData fields other than "pgstat_info" have "rd_" prefixes; add that
prefix to these fields.

This is a nonstandard place to clear fields. Clear them in
load_relcache_init_file() only, like we do for rd_statvalid. (Other paths
will then rely on palloc0() for implicit initialization.)

> --- a/src/backend/access/heap/heapam.c
> +++ b/src/backend/access/heap/heapam.c

> @@ -3991,7 +4007,8 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
> MarkBufferDirty(buffer);
>
> /* XLOG stuff */
> - if (RelationNeedsWAL(relation))
> + if (BufferNeedsWAL(relation, buffer) ||
> + BufferNeedsWAL(relation, newbuf))

This is fine if both buffers need WAL or neither buffer needs WAL. It is not
fine when one buffer needs WAL and the other buffer does not. My attachment
includes a test case. Of the bugs I'm reporting, this one seems most
difficult to solve well.

> @@ -8961,9 +8978,16 @@ heap2_redo(XLogReaderState *record)
> * heap_sync - sync a heap, for use when no WAL has been written
> *
> * This forces the heap contents (including TOAST heap if any) down to disk.
> - * If we skipped using WAL, and WAL is otherwise needed, we must force the
> - * relation down to disk before it's safe to commit the transaction. This
> - * requires writing out any dirty buffers and then doing a forced fsync.
> + * If we did any changes to the heap bypassing the buffer manager, we must
> + * force the relation down to disk before it's safe to commit the
> + * transaction, because the direct modifications will not be flushed by
> + * the next checkpoint.
> + *
> + * We used to also use this after batch operations like COPY and CLUSTER,
> + * if we skipped using WAL and WAL is otherwise needed, but there were
> + * corner-cases involving other WAL-logged operations to the same
> + * relation, where that was not enough. heap_register_sync() should be
> + * used for that purpose instead.

We still use heap_sync() in CLUSTER. Can we migrate CLUSTER to the newer
heap_register_sync()? Patch v7 makes some commands use the new way (COPY,
CREATE TABLE AS, REFRESH MATERIALIZED VIEW, ALTER TABLE) and leaves other
commands using the old way (CREATE INDEX USING btree, ALTER TABLE SET
TABLESPACE, CLUSTER). It would make the system simpler to understand if we
eliminated the old way. If that creates more problems than it solves, please
at least write down a coding rule to explain why certain commands shouldn't
use the old way.

Thanks,
nm

Attachment Content-Type Size
wal-optimize-noah-tests-v1.patch text/plain 3.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2019-03-11 02:51:21 Re: pgsql: tableam: introduce table AM infrastructure.
Previous Message Nagaura, Ryohei 2019-03-11 02:24:58 RE: Timeout parameters