Re: [HACKERS] WAL logging problem in 9.4.3?

From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: noah(at)leadboat(dot)com
Cc: 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, robertmhaas(at)gmail(dot)com, michael(at)paquier(dot)xyz, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [HACKERS] WAL logging problem in 9.4.3?
Date: 2019-03-20 08:17:54
Message-ID: 20190320.171754.171896368.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thank you for reviewing!

At Sun, 10 Mar 2019 19:27:08 -0700, Noah Misch <noah(at)leadboat(dot)com> wrote in <20190311022708(dot)GA2189728(at)rfd(dot)leadboat(dot)com>
> This has been waiting for a review since October, so I reviewed it. The code
> comment at PendingRelSync summarizes the design well, and I like that design.

It is Michael's work.

> I also liked the design in the https://postgr.es/m/559FA0BA.3080808@iki.fi
> last paragraph, and I suspect it would have been no harder to back-patch. I
> wonder if it would have been simpler and better, but I'm not asking anyone to
> investigate that. Let's keep pursuing your current design.

I must admit that this is complex..

> This moves a shared_buffers scan and smgrimmedsync() from commands like COPY
> to COMMIT. Users setting a timeout on COMMIT may need to adjust, and
> log_min_duration_statement analysis will reflect the change. I feel that's
> fine. (There already exist ways for COMMIT to be slow.)
>
> On Mon, Mar 04, 2019 at 12:24:48PM +0900, Kyotaro HORIGUCHI wrote:
> > --- a/src/backend/access/nbtree/nbtsort.c
> > +++ b/src/backend/access/nbtree/nbtsort.c
> > @@ -611,8 +611,14 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
> > /* Ensure rd_smgr is open (could have been closed by relcache flush!) */
> > RelationOpenSmgr(wstate->index);
> >
> > - /* XLOG stuff */
> > - if (wstate->btws_use_wal)
> > + /* XLOG stuff
> > + *
> > + * Even if minimal mode, WAL is required here if truncation happened after
> > + * being created in the same transaction. It is not needed otherwise but
> > + * we don't bother identifying the case precisely.
> > + */
> > + if (wstate->btws_use_wal ||
> > + (blkno == BTREE_METAPAGE && BTPageGetMeta(page)->btm_root == 0))
>
> We initialized "btws_use_wal" like this:
>
> #define XLogIsNeeded() (wal_level >= WAL_LEVEL_REPLICA)
> #define RelationNeedsWAL(relation) \
> ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
> wstate.btws_use_wal = XLogIsNeeded() && RelationNeedsWAL(wstate.index);
>
> Hence, this change causes us to emit WAL for the metapage of a
> RELPERSISTENCE_UNLOGGED or RELPERSISTENCE_TEMP relation. We should never do
> that. If we do that for RELPERSISTENCE_TEMP, redo will write to a permanent
> relfilenode. I've attached a test case for this; it is a patch that applies
> on top of your v7 patches. The test checks for orphaned files after redo.

Oops! Added RelationNeedsWAL(index) there. (Attched 1st patch on
top of this patchset)

> > + * If no tuple was inserted, it's possible that we are truncating a
> > + * relation. We need to emit WAL for the metapage in the case. However it
> > + * is not required elsewise,
>
> Did you mean to write more words after that comma?

Sorry, it is just a garbage. Required work is done in
_bt_blwritepage.

> > --- a/src/backend/catalog/storage.c
> > +++ b/src/backend/catalog/storage.c
>
> > + * NB: after WAL-logging has been skipped for a block, we must not WAL-log
> > + * any subsequent actions on the same block either. Replaying the WAL record
> > + * of the subsequent action might fail otherwise, as the "before" state of
> > + * the block might not match, as the earlier actions were not WAL-logged.
>
> Good point. To participate in WAL redo properly, each "before" state must
> have a distinct pd_lsn. In CREATE INDEX USING btree, the initial index build
> skips WAL, but an INSERT later in the same transaction writes WAL. There,
> however, each "before" state does have a distinct pd_lsn; the initial build
> has pd_lsn==0, and each subsequent state has a pd_lsn driven by WAL position.
> Hence, I think the CREATE INDEX USING btree behavior is fine, even though it
> doesn't conform to this code comment.

(The NB is Michael's work.)
Yes. Btree works differently from heap. Thak you for confirmation.

> I think this restriction applies only to full_page_writes=off. Otherwise, the
> first WAL-logged change will find pd_lsn==0 and emit a full-page image. With
> a full-page image in the record, the block's "before" state doesn't matter.
> Also, one could make it safe to write WAL for a particular block by issuing
> heap_sync() for the block's relation.

Umm.. Once truncate happens, WAL is emitted for all pages. If we
decide to skip WALs on copy or similar bulk operations, WALs are
not emitted at all, including XLOG_HEAP_INIT_PAGE. So that
doesn't happen. The unlogged data is synced at commit time.

> > +/*
> > + * RelationRemovePendingSync() -- remove pendingSync entry for a relation
> > + */
> > +void
> > +RelationRemovePendingSync(Relation rel)
>
> What is the coding rule for deciding when to call this? Currently, only
> ATExecSetTableSpace() calls this. CLUSTER doesn't call it, despite behaving
> much like ALTER TABLE SET TABLESPACE behaves.
> > +{
> > + bool found;
> > +
> > + rel->pending_sync = NULL;
> > + rel->no_pending_sync = true;
> > + if (pendingSyncs)
> > + {
> > + elog(DEBUG2, "RelationRemovePendingSync: accessing hash");
> > + hash_search(pendingSyncs, (void *) &rel->rd_node, HASH_REMOVE, &found);
> > + }
> > +}
>
> We'd need a mechanism to un-remove the sync at subtransaction abort. My
> attachment includes a test case demonstrating the consequences of that defect.
> Please look for other areas that need to know about subtransactions; patch v7
> had no code pertaining to subtransactions.

Agreed It forgets about subtransaction rollbacks. I'll make
RelationRemovePendingSync just mark as "removed" and make
ROLLBACK TO and RELEASE process the flag make it work. (Attached
2nd patch on top of thie patchset)

>
> > + elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because sync_above is %u",
>
> As you mention upthread, you have many debugging elog()s. These are too
> detailed to include in every binary, but I do want them in the code. See
> CACHE_elog() for a good example of achieving that.

Agreed will do. They were need to check the behavior precisely
but usually not needed.

> > +/*
> > + * Sync to disk any relations that we skipped WAL-logging for earlier.
> > + */
> > +void
> > +smgrDoPendingSyncs(bool isCommit)
> > +{
> > + if (!pendingSyncs)
> > + return;
> > +
> > + if (isCommit)
> > + {
> > + HASH_SEQ_STATUS status;
> > + PendingRelSync *pending;
> > +
> > + hash_seq_init(&status, pendingSyncs);
> > +
> > + while ((pending = hash_seq_search(&status)) != NULL)
> > + {
> > + if (pending->sync_above != InvalidBlockNumber)
>
> I'm mildly unhappy that pendingSyncs entries with "pending->sync_above ==
> InvalidBlockNumber" are not sync requests at all. Those just record the fact
> of a RelationTruncate() happening. If you can think of a way to improve that,
> please do so. If not, it's okay.

After a truncation, required WAL records are emitted for the
truncated pages, so no need to sync. Does this make sense for
you? (Maybe commit is needed there)

> > --- a/src/backend/utils/cache/relcache.c
> > +++ b/src/backend/utils/cache/relcache.c
>
> > @@ -412,6 +413,10 @@ AllocateRelationDesc(Form_pg_class relp)
> > /* which we mark as a reference-counted tupdesc */
> > relation->rd_att->tdrefcount = 1;
> >
> > + /* We don't know if pending sync for this relation exists so far */
> > + relation->pending_sync = NULL;
> > + relation->no_pending_sync = false;
>
> RelationData fields other than "pgstat_info" have "rd_" prefixes; add that
> prefix to these fields.
> This is a nonstandard place to clear fields. Clear them in
> load_relcache_init_file() only, like we do for rd_statvalid. (Other paths
> will then rely on palloc0() for implicit initialization.)

Agreed, will do in the next version.

> > --- a/src/backend/access/heap/heapam.c
> > +++ b/src/backend/access/heap/heapam.c
>
> > @@ -3991,7 +4007,8 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
> > MarkBufferDirty(buffer);
> >
> > /* XLOG stuff */
> > - if (RelationNeedsWAL(relation))
> > + if (BufferNeedsWAL(relation, buffer) ||
> > + BufferNeedsWAL(relation, newbuf))
>
> This is fine if both buffers need WAL or neither buffer needs WAL. It is not
> fine when one buffer needs WAL and the other buffer does not. My attachment
> includes a test case. Of the bugs I'm reporting, this one seems most
> difficult to solve well.

Yeah, it is right (and it's rather silly). Thank you for
pointing out. Will fix.

> > @@ -8961,9 +8978,16 @@ heap2_redo(XLogReaderState *record)
> > * heap_sync - sync a heap, for use when no WAL has been written
> > *
> > * This forces the heap contents (including TOAST heap if any) down to disk.
> > - * If we skipped using WAL, and WAL is otherwise needed, we must force the
> > - * relation down to disk before it's safe to commit the transaction. This
> > - * requires writing out any dirty buffers and then doing a forced fsync.
> > + * If we did any changes to the heap bypassing the buffer manager, we must
> > + * force the relation down to disk before it's safe to commit the
> > + * transaction, because the direct modifications will not be flushed by
> > + * the next checkpoint.
> > + *
> > + * We used to also use this after batch operations like COPY and CLUSTER,
> > + * if we skipped using WAL and WAL is otherwise needed, but there were
> > + * corner-cases involving other WAL-logged operations to the same
> > + * relation, where that was not enough. heap_register_sync() should be
> > + * used for that purpose instead.
>
> We still use heap_sync() in CLUSTER. Can we migrate CLUSTER to the newer
> heap_register_sync()? Patch v7 makes some commands use the new way (COPY,
> CREATE TABLE AS, REFRESH MATERIALIZED VIEW, ALTER TABLE) and leaves other
> commands using the old way (CREATE INDEX USING btree, ALTER TABLE SET
> TABLESPACE, CLUSTER). It would make the system simpler to understand if we
> eliminated the old way. If that creates more problems than it solves, please
> at least write down a coding rule to explain why certain commands shouldn't
> use the old way.

Perhaps doable for TABLESPACE and CLUSTER. I'm not sure about
CREATE INDEX. I'll consider them.

I don't have enough time for now so the new version will be
posted early next week.

Thanks you for the review!

regards.

Attachment Content-Type Size
pending_sync_nbtsort_fix.patch text/x-patch 1.2 KB
pending_sync_fix_tblsp_subxact.patch text/x-patch 2.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Langote 2019-03-20 08:20:37 Re: speeding up planning with partitions
Previous Message Fabien COELHO 2019-03-20 08:00:54 Re: Offline enabling/disabling of data checksums