Re: [HACKERS] WAL logging problem in 9.4.3?

From: Noah Misch <noah(at)leadboat(dot)com>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, hlinnaka <hlinnaka(at)iki(dot)fi>, Michael Paquier <michael(at)paquier(dot)xyz>
Subject: Re: [HACKERS] WAL logging problem in 9.4.3?
Date: 2019-11-18 04:54:34
Message-ID: 20191118045434.GA1173436@rfd.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Nov 05, 2019 at 02:53:35PM -0800, Noah Misch wrote:
> I started pre-commit editing on 2019-10-28, and comment+README updates have
> been the largest part of that. I'll check my edits against the things you
> list here, and I'll share on-list before committing. I've now marked the CF
> entry Ready for Committer.

Having dedicated many days to that, I am attaching v24nm. I know of two
remaining defects:

=== Defect 1: gistGetFakeLSN()

When I modified pg_regress.c to use wal_level=minimal for all suites,
src/test/isolation/specs/predicate-gist.spec failed the assertion in
gistGetFakeLSN(). One could reproduce the problem just by running this
sequence in psql:

begin;
create table gist_point_tbl(id int4, p point);
create index gist_pointidx on gist_point_tbl using gist(p);
insert into gist_point_tbl (id, p)
select g, point(g*10, g*10) from generate_series(1, 1000) g;

I've included a wrong-in-general hack to make the test pass. I see two main
options for fixing this:

(a) Introduce an empty WAL record that reserves an LSN and has no other
effect. Make GiST use that for permanent relations that are skipping WAL.
Further optimizations are possible. For example, we could use a backend-local
counter (like the one gistGetFakeLSN() uses for temp relations) until the
counter is greater a recent real LSN. That optimization is probably too
clever, though it would make the new WAL record almost never appear.

(b) Exempt GiST from most WAL skipping. GiST index build could still skip
WAL, but it would do its own smgrimmedsync() in addition to the one done at
commit. Regular GiST mutations would test RELPERSISTENCE_PERMANENT instead of
RelationNeedsWal(), and we'd need some hack for index_copy_data() and possibly
other AM-independent code that skips WAL.

Overall, I like the cleanliness of (a). The main argument for (b) is that it
ensures we have all the features to opt-out of WAL skipping, which could be
useful for out-of-tree index access methods. (I think we currently have the
features for a tableam to do so, but not for an indexam to do so.) Overall, I
lean toward (a). Any other ideas or preferences?

=== Defect 2: repetitive work when syncing many relations

For deleting relfilenodes, smgrDoPendingDeletes() collects a list for
smgrdounlinkall() to pass to DropRelFileNodesAllBuffers(), which is
sophisticated about optimizing the shared buffers scan. Commit 279628a
introduced that, in 2013. I think smgrDoPendingSyncs() should do likewise, to
further reduce the chance of causing performance regressions. (One could,
however, work around the problem by raising wal_skip_threshold.) Kyotaro, if
you agree, could you modify v24nm to implement that?

Notable changes in v24nm:

- Wrote section "Skipping WAL for New RelFileNode" in
src/backend/access/transam/README to be the main source concerning the new
coding rules.

- Updated numerous comments and doc sections.

- Eliminated the pendingSyncs list in favor of a "sync" field in
pendingDeletes. I mostly did this to eliminate the possibility of the lists
getting out of sync. This removed considerable parallel code for managing a
second list at end-of-xact. We now call smgrDoPendingSyncs() only when
committing or preparing a top-level transaction.

- Whenever code sets an rd_*Subid field of a Relation, it must call
EOXactListAdd(). swap_relation_files() was not doing so, so the field
remained set during the next transaction. I introduced
RelationAssumeNewRelfilenode() to handle both tasks, and I located the call
so it also affects the mapped relation case.

- In ALTER TABLE cases where TryReuseIndex() avoided an index rebuild,
rd_createSubid remained set. (That happened before this patch, but it has
been harmless.) I fixed this in heap_create().

- Made smgrDoPendingSyncs() stop exempting FSM_FORKNUM. A sync is necessary
when checksums are enabled. Observe the precedent that
RelationCopyStorage() has not been exempting FSM_FORKNUM.

- Pass log_newpage_range() a "false" for page_std, for the same reason
RelationCopyStorage() does.

- log_newpage_range() ignored its forkNum and page_std arguments, so we logged
the wrong data for non-main forks. Before this patch, callers always passed
MAIN_FORKNUM and "true", hence the lack of complaints.

- Restored table_finish_bulk_insert(), though heapam no longer provides a
callback. The API is still well-defined, and other table AMs might have use
for it. Removing it feels like a separate proposal.

- Removed TABLE_INSERT_SKIP_WAL. Any out-of-tree code using it should revisit
itself in light of this patch.

- Fixed smgrDoPendingSyncs() to reinitialize total_blocks for each relation;
it was overcounting.

- Made us skip WAL after SET TABLESPACE, like we do after CLUSTER.

- Moved the wal_skip_threshold docs from "Resource Consumption" -> "Disk" to
"Write Ahead Log" -> "Settings", between similar settings
wal_writer_flush_after and commit_delay. The other place I considered was
"Resource Consumption" -> "Asynchronous Behavior", due to the similarity of
backend_flush_after.

- Gave each test a unique name. Changed test table names to be descriptive,
e.g. test7 became trunc_trig.

- Squashed all patches into one. Split patches are good when one could
reasonably choose to push a subset, but that didn't apply here. I wouldn't
push a GUC implementation without its documentation. Since the tests fail
without the main bug fix, I wouldn't push tests separately.

By the way, based on the comment at zheap_prepare_insert(), I expect zheap
will exempt itself from skipping WAL. It may stop calling RelationNeedsWAL()
and instead test for RELPERSISTENCE_PERMANENT.

nm

Attachment Content-Type Size
skip-wal-v24nm.patch text/plain 81.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2019-11-18 05:02:44 Re: dropdb --force
Previous Message vignesh C 2019-11-18 03:43:14 Re: dropdb --force