Re: [HACKERS] WAL logging problem in 9.4.3?

From: Noah Misch <noah(at)leadboat(dot)com>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc: robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, michael(at)paquier(dot)xyz
Subject: Re: [HACKERS] WAL logging problem in 9.4.3?
Date: 2019-12-26 00:15:21
Message-ID: 20191226001521.GA1772687@rfd.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

By improving AssertPendingSyncs_RelationCache() and by testing with
-DRELCACHE_FORCE_RELEASE, I now know of three defects in the attached v30nm.
Would you fix these?

=== Defect 1: Forgets to skip WAL after SAVEPOINT; DROP TABLE; ROLLBACK TO

A test in transactions.sql now fails in AssertPendingSyncs_RelationCache(),
when running "make check" under wal_level=minimal. I test this way:

printf '%s\n%s\n' 'wal_level = minimal' 'max_wal_senders = 0' >$PWD/minimal.conf
make check TEMP_CONFIG=$PWD/minimal.conf

Self-contained demonstration:
begin;
create table t (c int);
savepoint q; drop table t; rollback to q; -- forgets table is skipping wal
commit; -- assertion failure

=== Defect 2: Forgets to skip WAL due to oversimplification in heap_create()

In ALTER TABLE cases where TryReuseIndex() avoided an index rebuild, we need
to transfer WAL-skipped state to the new index relation. Before v24nm, the
new index relation skipped WAL unconditionally. Since v24nm, the new index
relation never skips WAL. I've added a test to alter_table.sql that reveals
this problem under wal_level=minimal.

=== Defect 3: storage.c checks size decrease of MAIN_FORKNUM only

storage.c tracks only MAIN_FORKNUM in pendingsync->max_truncated. Is it
possible for MAIN_FORKNUM to have a net size increase while FSM_FORKNUM has a
net size decrease? I haven't tested, but this sequence seems possible:

TRUNCATE
reduces MAIN_FORKNUM from 100 blocks to 0 blocks
reduces FSM_FORKNUM from 3 blocks to 0 blocks
COPY
raises MAIN_FORKNUM from 0 blocks to 110 blocks
does not change FSM_FORKNUM
COMMIT
should fsync, but wrongly chooses log_newpage_range() approach

If that's indeed a problem, beside the obvious option of tracking every fork's
max_truncated, we could convert max_truncated to a bool and use fsync anytime
the relation experienced an mdtruncate(). (While FSM_FORKNUM is not critical
for database operations, the choice to subject it to checksums entails
protecting it here.) If that's not a problem, would you explain?

=== Non-defect notes

Once you have a correct patch, would you run check-world with
-DCLOBBER_CACHE_ALWAYS? That may reveal additional defects. It may take a
day or more, but that's fine.

The new smgrimmedsync() calls are potentially fragile, because they sometimes
target a file of a dropped relation. However, the mdexists() test prevents
anything bad from happening. No change is necessary. Example:

SET wal_skip_threshold = 0;
BEGIN;
SAVEPOINT q;
CREATE TABLE t (c) AS SELECT 1;
ROLLBACK TO q; -- truncates the relfilenode
CHECKPOINT; -- unlinks the relfilenode
COMMIT; -- calls mdexists() on the relfilenode

=== Notable changes in v30nm

- Changed "wal_skip_threshold * 1024" to an expression that can't overflow.
Without this, wal_skip_threshold=1TB behaved like wal_skip_threshold=0.

- Changed AssertPendingSyncs_RelationCache() to open all relations on which
the transaction holds locks. This permits detection of cases where
RelationNeedsWAL() returns true but storage.c will sync the relation.

Removed the assertions from RelationIdGetRelation(). Using
"-DRELCACHE_FORCE_RELEASE" made them fail for usage patterns that aren't
actually problematic, since invalidation updates rd_node while other code
updates rd_firstRelfilenodeSubid. This is not a significant loss, now that
AssertPendingSyncs_RelationCache() opens relations. (I considered making
the update of rd_firstRelfilenodeSubid more like rd_node, where we store it
somewhere until the next CommandCounterIncrement(), which would make it
actually affect RelationNeedsWAL(). That might have been better in general,
but it felt complex without clear benefits.)

Skip AssertPendingSyncs_RelationCache() at abort, like v24nm did. Making
that work no matter what does ereport(ERROR) would be tricky and low-value.

- Extracted the RelationTruncate() changes into new function
RelationPreTruncate(), so table access methods that can't use
RelationTruncate() have another way to request that work.

- Changed wal_skip_threshold default to 2MB. My second preference was for
4MB. In your data, 2MB and 4MB had similar performance at optimal
wal_buffers, but 4MB performed worse at low wal_buffers.

- Reverted most post-v24nm changes to swap_relation_files(). Under
"-DRELCACHE_FORCE_RELEASE", relcache.c quickly discards the
rel1->rd_node.relNode update. Clearing rel2->rd_createSubid is not right if
we're running CLUSTER for the second time in one transaction. I used
relation_open(r1, NoLock) instead of AccessShareLock, because we deserve an
assertion failure if we hold no lock at that point.

- Change toast_get_valid_index() to retain locks until end of transaction.
When I adopted relation_open(r1, NoLock) in swap_relation_files(), that
revealed that we retain no lock on the TOAST index.

- Ran pgindent and perltidy. Updated some comments and names.

On Mon, Dec 09, 2019 at 06:04:06PM +0900, Kyotaro Horiguchi wrote:
> Anyway the default value ought to be defined based on the default
> configuration.

PostgreSQL does not follow that principle. Settings that change permanent
resource consumption, such as wal_buffers, have small defaults. Settings that
don't change permanent resource consumption can have defaults that favor a
well-tuned system.

Attachment Content-Type Size
skip-wal-v30nm.patch text/plain 95.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tatsuo Ishii 2019-12-26 00:26:39 Re: Implementing Incremental View Maintenance
Previous Message Pavel Stehule 2019-12-25 21:45:49 Re: proposal: schema variables