Re: [HACKERS] WAL logging problem in 9.4.3?

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: noah(at)leadboat(dot)com
Cc: robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, michael(at)paquier(dot)xyz
Subject: Re: [HACKERS] WAL logging problem in 9.4.3?
Date: 2019-12-26 03:46:39
Message-ID: 20191226.124639.5401358775142406.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thank you for the findings.

At Wed, 25 Dec 2019 16:15:21 -0800, Noah Misch <noah(at)leadboat(dot)com> wrote in
> By improving AssertPendingSyncs_RelationCache() and by testing with
> -DRELCACHE_FORCE_RELEASE, I now know of three defects in the attached v30nm.
> Would you fix these?

I'd like to do that, please give me som time.

> === Defect 1: Forgets to skip WAL after SAVEPOINT; DROP TABLE; ROLLBACK TO
>
> A test in transactions.sql now fails in AssertPendingSyncs_RelationCache(),
> when running "make check" under wal_level=minimal. I test this way:
>
> printf '%s\n%s\n' 'wal_level = minimal' 'max_wal_senders = 0' >$PWD/minimal.conf
> make check TEMP_CONFIG=$PWD/minimal.conf
>
> Self-contained demonstration:
> begin;
> create table t (c int);
> savepoint q; drop table t; rollback to q; -- forgets table is skipping wal
> commit; -- assertion failure
>
>
> === Defect 2: Forgets to skip WAL due to oversimplification in heap_create()
>
> In ALTER TABLE cases where TryReuseIndex() avoided an index rebuild, we need
> to transfer WAL-skipped state to the new index relation. Before v24nm, the
> new index relation skipped WAL unconditionally. Since v24nm, the new index
> relation never skips WAL. I've added a test to alter_table.sql that reveals
> this problem under wal_level=minimal.
>
>
> === Defect 3: storage.c checks size decrease of MAIN_FORKNUM only
>
> storage.c tracks only MAIN_FORKNUM in pendingsync->max_truncated. Is it
> possible for MAIN_FORKNUM to have a net size increase while FSM_FORKNUM has a
> net size decrease? I haven't tested, but this sequence seems possible:
>
> TRUNCATE
> reduces MAIN_FORKNUM from 100 blocks to 0 blocks
> reduces FSM_FORKNUM from 3 blocks to 0 blocks
> COPY
> raises MAIN_FORKNUM from 0 blocks to 110 blocks
> does not change FSM_FORKNUM
> COMMIT
> should fsync, but wrongly chooses log_newpage_range() approach
>
> If that's indeed a problem, beside the obvious option of tracking every fork's
> max_truncated, we could convert max_truncated to a bool and use fsync anytime
> the relation experienced an mdtruncate(). (While FSM_FORKNUM is not critical
> for database operations, the choice to subject it to checksums entails
> protecting it here.) If that's not a problem, would you explain?
>

> === Non-defect notes
>
> Once you have a correct patch, would you run check-world with
> -DCLOBBER_CACHE_ALWAYS? That may reveal additional defects. It may take a
> day or more, but that's fine.

Sure.

> The new smgrimmedsync() calls are potentially fragile, because they sometimes
> target a file of a dropped relation. However, the mdexists() test prevents
> anything bad from happening. No change is necessary. Example:
>
> SET wal_skip_threshold = 0;
> BEGIN;
> SAVEPOINT q;
> CREATE TABLE t (c) AS SELECT 1;
> ROLLBACK TO q; -- truncates the relfilenode
> CHECKPOINT; -- unlinks the relfilenode
> COMMIT; -- calls mdexists() on the relfilenode
>
>
> === Notable changes in v30nm
>
> - Changed "wal_skip_threshold * 1024" to an expression that can't overflow.
> Without this, wal_skip_threshold=1TB behaved like wal_skip_threshold=0.

Ahh.., I wrongly understood that MAX_KILOBYTES inhibits that
setting. work_mem and maintenance_work_mem are casted to double or
long before calculation. In this case it's enough that calculation
unit becomes kilobytes instad of bytes.

> - Changed AssertPendingSyncs_RelationCache() to open all relations on which
> the transaction holds locks. This permits detection of cases where
> RelationNeedsWAL() returns true but storage.c will sync the relation.
>
> Removed the assertions from RelationIdGetRelation(). Using
> "-DRELCACHE_FORCE_RELEASE" made them fail for usage patterns that aren't
> actually problematic, since invalidation updates rd_node while other code
> updates rd_firstRelfilenodeSubid. This is not a significant loss, now that
> AssertPendingSyncs_RelationCache() opens relations. (I considered making
> the update of rd_firstRelfilenodeSubid more like rd_node, where we store it
> somewhere until the next CommandCounterIncrement(), which would make it
> actually affect RelationNeedsWAL(). That might have been better in general,
> but it felt complex without clear benefits.)
>
> Skip AssertPendingSyncs_RelationCache() at abort, like v24nm did. Making
> that work no matter what does ereport(ERROR) would be tricky and low-value.

Right about ereport, but I'm not sure remove the whole assertion from abort.

> - Extracted the RelationTruncate() changes into new function
> RelationPreTruncate(), so table access methods that can't use
> RelationTruncate() have another way to request that work.

Sounds reasonable. Also the new behavior of max_truncated looks fine.

> - Changed wal_skip_threshold default to 2MB. My second preference was for
> 4MB. In your data, 2MB and 4MB had similar performance at optimal
> wal_buffers, but 4MB performed worse at low wal_buffers.

That's fine with me.

> - Reverted most post-v24nm changes to swap_relation_files(). Under
> "-DRELCACHE_FORCE_RELEASE", relcache.c quickly discards the
> rel1->rd_node.relNode update. Clearing rel2->rd_createSubid is not right if
> we're running CLUSTER for the second time in one transaction. I used

I don't agree to that. As I think I have mentioned upthread, rel2 is
wrongly marked as "new in this tranction" at that time, which hinders
the opportunity of removal and such entries wrongly persist for the
backend life and causes problems. (That was found by abort-time
AssertPendingSyncs_RelationCache()..)

> relation_open(r1, NoLock) instead of AccessShareLock, because we deserve an
> assertion failure if we hold no lock at that point.

I agree to that.

> - Change toast_get_valid_index() to retain locks until end of transaction.
> When I adopted relation_open(r1, NoLock) in swap_relation_files(), that
> revealed that we retain no lock on the TOAST index.

Sounds more reasonable than open_relation(AnyLock) in swap_relation_files.

> - Ran pgindent and perltidy. Updated some comments and names.
>
> On Mon, Dec 09, 2019 at 06:04:06PM +0900, Kyotaro Horiguchi wrote:
> > Anyway the default value ought to be defined based on the default
> > configuration.
>
> PostgreSQL does not follow that principle. Settings that change permanent
> resource consumption, such as wal_buffers, have small defaults. Settings that
> don't change permanent resource consumption can have defaults that favor a
> well-tuned system.

I think I understand that, actually 4MB was too large, though.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Noah Misch 2019-12-26 04:22:04 Re: [HACKERS] WAL logging problem in 9.4.3?
Previous Message Yugo Nagata 2019-12-26 02:36:47 Re: Implementing Incremental View Maintenance