Re: [HACKERS] WAL logging problem in 9.4.3?

From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: noah(at)leadboat(dot)com
Cc: 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, robertmhaas(at)gmail(dot)com, michael(at)paquier(dot)xyz, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [HACKERS] WAL logging problem in 9.4.3?
Date: 2019-03-25 12:32:04
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello. This is a revised version.

At Wed, 20 Mar 2019 22:48:35 -0700, Noah Misch <noah(at)leadboat(dot)com> wrote in <20190321054835(dot)GB3842129(at)rfd(dot)leadboat(dot)com>
> On Wed, Mar 20, 2019 at 05:17:54PM +0900, Kyotaro HORIGUCHI wrote:
> > At Sun, 10 Mar 2019 19:27:08 -0700, Noah Misch <noah(at)leadboat(dot)com> wrote in <20190311022708(dot)GA2189728(at)rfd(dot)leadboat(dot)com>
> > > On Mon, Mar 04, 2019 at 12:24:48PM +0900, Kyotaro HORIGUCHI wrote:
> > > I'm mildly unhappy that pendingSyncs entries with "pending->sync_above ==
> > > InvalidBlockNumber" are not sync requests at all. Those just record the fact
> > > of a RelationTruncate() happening. If you can think of a way to improve that,
> > > please do so. If not, it's okay.
> >
> > After a truncation, required WAL records are emitted for the
> > truncated pages, so no need to sync. Does this make sense for
> > you? (Maybe commit is needed there)
> Yes, the behavior makes sense. I wasn't saying the quoted code had the wrong
> behavior. I was saying that the data structure called "pendingSyncs" is
> actually "pending syncs and past truncates". It's not ideal that the variable
> name differs from the variable purpose in this way. However, it's okay if you
> don't find a way to improve that.

It is convincing. The current member names "sync_above" and
"truncated_to" are wordings based on the operations that have
happened on the relation. I changed the names to words based on
what to do on the relation. Renamed to skip_wal_min_blk and

> > I don't have enough time for now so the new version will be
> > posted early next week.
> I'll wait for that version.

At Wed, 20 Mar 2019 17:17:54 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote in <20190320(dot)171754(dot)171896368(dot)horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
> > Hence, this change causes us to emit WAL for the metapage of a
> > that. If we do that for RELPERSISTENCE_TEMP, redo will write to a permanent
> > relfilenode. I've attached a test case for this; it is a patch that applies
> > on top of your v7 patches. The test checks for orphaned files after redo.
> Oops! Added RelationNeedsWAL(index) there. (Attched 1st patch on
> top of this patchset)

Done in the attached patch. But the orphan file check in the TAP
diff was wrong. It detects orphaned pg_class entry for temprary
tables, which dissapears after the first autovacuum. The revised
tap test (check_orphan_relfilenodes) doesn't faultly fail and
catches the bug in the previous patch.

> > > + * If no tuple was inserted, it's possible that we are truncating a
> > > + * relation. We need to emit WAL for the metapage in the case. However it
> > > + * is not required elsewise,
> >
> > Did you mean to write more words after that comma?
> Sorry, it is just a garbage. Required work is done in
> _bt_blwritepage.


> > We'd need a mechanism to un-remove the sync at subtransaction abort. My
> > attachment includes a test case demonstrating the consequences of that defect.
> > Please look for other areas that need to know about subtransactions; patch v7
> > had no code pertaining to subtransactions.

Added. Passed the new tests.

> > > + elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because sync_above is %u",
> >
> > As you mention upthread, you have many debugging elog()s. These are too
> > detailed to include in every binary, but I do want them in the code. See
> > CACHE_elog() for a good example of achieving that.
> Agreed will do. They were need to check the behavior precisely
> but usually not needed.

I removed all such elog()s.

> > RelationData fields other than "pgstat_info" have "rd_" prefixes; add that
> > prefix to these fields.
> > This is a nonstandard place to clear fields. Clear them in
> > load_relcache_init_file() only, like we do for rd_statvalid. (Other paths
> > will then rely on palloc0() for implicit initialization.)

Both are done.

> > > - if (RelationNeedsWAL(relation))
> > > + if (BufferNeedsWAL(relation, buffer) ||
> > > + BufferNeedsWAL(relation, newbuf))
> >
> > This is fine if both buffers need WAL or neither buffer needs WAL. It is not
> > fine when one buffer needs WAL and the other buffer does not. My attachment
> > includes a test case. Of the bugs I'm reporting, this one seems most
> > difficult to solve well.

I refactored heap_insert/delete so that the XLOG stuff can be
used from heap_update. Then modify heap_update so that it emits

> > We still use heap_sync() in CLUSTER. Can we migrate CLUSTER to the newer
> > heap_register_sync()? Patch v7 makes some commands use the new way (COPY,
> > commands using the old way (CREATE INDEX USING btree, ALTER TABLE SET
> > TABLESPACE, CLUSTER). It would make the system simpler to understand if we
> > eliminated the old way. If that creates more problems than it solves, please
> > at least write down a coding rule to explain why certain commands shouldn't
> > use the old way.
> Perhaps doable for TABLESPACE and CLUSTER. I'm not sure about
> CREATE INDEX. I'll consider them.

I added the CLUSTER case in the new patchset. For the SET
TABLESPACE case, it works on SMGR layer and manipulates fork
files explicitly but this stuff is Relation based and doesn't
distinguish forks. We can modify this stuff to work on smgr and
make it fork-aware but I don't think it is worth doing.

CREATE INDEX is not changed in this version. I continue to
consider it.

The attached is the new patchset.

- Revised version of test.

- Fixed version of v7

- New file, moves xlog stuff of heap_insert and heap_delete out
of the functions so that heap_update can use them.

- Renamed variables, functions. Removed elogs.

- Fixed heap_update.

- New file, modifies CLUSTER to use this feature.

- New file, adds a comment that excuses for not using this stuff.


Kyotaro Horiguchi
NTT Open Source Software Center

Attachment Content-Type Size
v8-0001-TAP-test-for-copy-truncation-optimization.patch text/x-patch 9.5 KB
v8-0002-Write-WAL-for-empty-nbtree-index-build.patch text/x-patch 1.6 KB
v8-0003-Move-XLOG-stuff-from-heap_insert-and-heap_delete.patch text/x-patch 11.5 KB
v8-0004-Add-infrastructure-to-WAL-logging-skip-feature.patch text/x-patch 23.4 KB
v8-0005-Fix-WAL-skipping-feature.patch text/x-patch 18.7 KB
v8-0006-Change-cluster-to-use-the-new-pending-sync-infrastru.patch text/x-patch 8.2 KB
v8-0007-Add-a-comment-to-ATExecSetTableSpace.patch text/x-patch 1.3 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Korotkov 2019-03-25 12:44:08 pgsql: Get rid of backtracking in jsonpath_scan.l
Previous Message Konstantin Knizhnik 2019-03-25 12:12:27 Re: libpq compression