Re: [HACKERS] WAL logging problem in 9.4.3?

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: noah(at)leadboat(dot)com
Cc: robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, michael(at)paquier(dot)xyz
Subject: Re: [HACKERS] WAL logging problem in 9.4.3?
Date: 2019-11-20 08:31:43
Message-ID: 20191120.173143.1442042654954107403.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

At Sun, 17 Nov 2019 20:54:34 -0800, Noah Misch <noah(at)leadboat(dot)com> wrote in
> On Tue, Nov 05, 2019 at 02:53:35PM -0800, Noah Misch wrote:
> > I started pre-commit editing on 2019-10-28, and comment+README updates have
> > been the largest part of that. I'll check my edits against the things you
> > list here, and I'll share on-list before committing. I've now marked the CF
> > entry Ready for Committer.

I looked the version.

> Notable changes in v24nm:
>
> - Wrote section "Skipping WAL for New RelFileNode" in
> src/backend/access/transam/README to be the main source concerning the new
> coding rules.

Thanks for writing this.

+Prefer to do the same in future access methods. However, two other approaches
+can work. First, an access method can irreversibly transition a given fork
+from WAL-skipping to WAL-writing by calling FlushRelationBuffers() and
+smgrimmedsync(). Second, an access method can opt to write WAL
+unconditionally for permanent relations. When using the second method, do not
+call RelationCopyStorage(), which skips WAL.

Even using these methods, TransactionCommit flushes out buffers then
sync files again. Isn't a description something like the following
needed?

===
Even an access method switched a in-transaction created relfilenode to
WAL-writing, Commit(Prepare)Transaction flushed all buffers for the
file then smgrimmedsync() the file.
===

> - Updated numerous comments and doc sections.
>
> - Eliminated the pendingSyncs list in favor of a "sync" field in
> pendingDeletes. I mostly did this to eliminate the possibility of the lists
> getting out of sync. This removed considerable parallel code for managing a
> second list at end-of-xact. We now call smgrDoPendingSyncs() only when
> committing or preparing a top-level transaction.

Mmm. Right. The second list was a trace of older versions, maybe that
needed additional works at rollback. Actually as of v23 the function
syncs no files at rollback. It is wiser to merging the two.

> - Whenever code sets an rd_*Subid field of a Relation, it must call
> EOXactListAdd(). swap_relation_files() was not doing so, so the field
> remained set during the next transaction. I introduced
> RelationAssumeNewRelfilenode() to handle both tasks, and I located the call
> so it also affects the mapped relation case.

Ugh.. Thanks for pointing out. By the way

+ /*
+ * Recognize that rel1's relfilenode (swapped from rel2) is new in this
+ * subtransaction. Since the next step for rel2 is deletion, don't bother
+ * recording the newness of its relfilenode.
+ */
+ rel1 = relation_open(r1, AccessExclusiveLock);
+ RelationAssumeNewRelfilenode(rel1);

It cannot be accessed from other sessions. Theoretically it doesn't
need a lock but NoLock cannot be used there since there's a path that
doesn't take lock on the relation. But AEL seems too strong and it
causes unecessary side effect. Couldn't we use weaker locks?

... Time is up. I'll continue looking this.

regards.

> - In ALTER TABLE cases where TryReuseIndex() avoided an index rebuild,
> rd_createSubid remained set. (That happened before this patch, but it has
> been harmless.) I fixed this in heap_create().
>
> - Made smgrDoPendingSyncs() stop exempting FSM_FORKNUM. A sync is necessary
> when checksums are enabled. Observe the precedent that
> RelationCopyStorage() has not been exempting FSM_FORKNUM.
>
> - Pass log_newpage_range() a "false" for page_std, for the same reason
> RelationCopyStorage() does.
>
> - log_newpage_range() ignored its forkNum and page_std arguments, so we logged
> the wrong data for non-main forks. Before this patch, callers always passed
> MAIN_FORKNUM and "true", hence the lack of complaints.
>
> - Restored table_finish_bulk_insert(), though heapam no longer provides a
> callback. The API is still well-defined, and other table AMs might have use
> for it. Removing it feels like a separate proposal.
>
> - Removed TABLE_INSERT_SKIP_WAL. Any out-of-tree code using it should revisit
> itself in light of this patch.
>
> - Fixed smgrDoPendingSyncs() to reinitialize total_blocks for each relation;
> it was overcounting.
>
> - Made us skip WAL after SET TABLESPACE, like we do after CLUSTER.
>
> - Moved the wal_skip_threshold docs from "Resource Consumption" -> "Disk" to
> "Write Ahead Log" -> "Settings", between similar settings
> wal_writer_flush_after and commit_delay. The other place I considered was
> "Resource Consumption" -> "Asynchronous Behavior", due to the similarity of
> backend_flush_after.
>
> - Gave each test a unique name. Changed test table names to be descriptive,
> e.g. test7 became trunc_trig.
>
> - Squashed all patches into one. Split patches are good when one could
> reasonably choose to push a subset, but that didn't apply here. I wouldn't
> push a GUC implementation without its documentation. Since the tests fail
> without the main bug fix, I wouldn't push tests separately.
>
> By the way, based on the comment at zheap_prepare_insert(), I expect zheap
> will exempt itself from skipping WAL. It may stop calling RelationNeedsWAL()
> and instead test for RELPERSISTENCE_PERMANENT.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Khandekar 2019-11-20 08:32:33 Re: logical decoding : exceeded maxAllocatedDescs for .spill files
Previous Message Peter Eisentraut 2019-11-20 08:12:05 Re: could not stat promote trigger file leads to shutdown