Re: [HACKERS] WAL logging problem in 9.4.3?

From: Noah Misch <noah(at)leadboat(dot)com>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc: robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, michael(at)paquier(dot)xyz
Subject: Re: [HACKERS] WAL logging problem in 9.4.3?
Date: 2019-11-23 21:21:36
Message-ID: 20191123212136.GA41522@gust.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Nov 20, 2019 at 03:05:46PM +0900, Kyotaro Horiguchi wrote:
> By the way, before finalize this, I'd like to share the result of a
> brief benchmarking.

What non-default settings did you use? Please give the output of this or a
similar command:

select name, setting from pg_settings where setting <> boot_val;

If you run more benchmarks and weren't already using wal_buffers=16MB, I
recommend using it.

> With 10 pgbench sessions.
> pages SYNC WAL
> 1: 915 ms 301 ms
> 3: 1634 ms 508 ms
> 5: 1634 ms 293ms
> 10: 1671 ms 1043 ms
> 17: 1600 ms 333 ms
> 31: 1864 ms 314 ms
> 56: 1562 ms 448 ms
> 100: 1538 ms 394 ms
> 177: 1697 ms 1047 ms
> 316: 3074 ms 1788 ms
> 562: 3306 ms 1245 ms
> 1000: 3440 ms 2182 ms
> 1778: 5064 ms 6464 ms # WAL's slope becomes steep
> 3162: 8675 ms 8165 ms

For picking a default wal_skip_threshold, it would have been more informative
to see how this changes pgbench latency statistics. Some people want DDL to
be fast, but more people want DDL not to reduce the performance of concurrent
non-DDL. This benchmark procedure may help:

1. Determine $DDL_COUNT, a number of DDL transactions that take about one
minute when done via syncs.
2. Start "pgbench -rP1 --progress-timestamp -T180 -c10 -j10".
3. Wait 10s.
4. Start one DDL backend that runs $DDL_COUNT transactions.
5. Save DDL start timestamp, DDL end timestamp, and pgbench output.

I would compare pgbench tps and latency between the seconds when DDL is and is
not running. As you did in earlier tests, I would repeat it using various
page counts, with and without sync.

On Wed, Nov 20, 2019 at 05:31:43PM +0900, Kyotaro Horiguchi wrote:
> +Prefer to do the same in future access methods. However, two other approaches
> +can work. First, an access method can irreversibly transition a given fork
> +from WAL-skipping to WAL-writing by calling FlushRelationBuffers() and
> +smgrimmedsync(). Second, an access method can opt to write WAL
> +unconditionally for permanent relations. When using the second method, do not
> +call RelationCopyStorage(), which skips WAL.
>
> Even using these methods, TransactionCommit flushes out buffers then
> sync files again. Isn't a description something like the following
> needed?
>
> ===
> Even an access method switched a in-transaction created relfilenode to
> WAL-writing, Commit(Prepare)Transaction flushed all buffers for the
> file then smgrimmedsync() the file.
> ===

It is enough that the text says to prefer the approach that core access
methods use. The extra flush and sync when using a non-preferred approach
wastes some performance, but it is otherwise harmless.

> + rel1 = relation_open(r1, AccessExclusiveLock);
> + RelationAssumeNewRelfilenode(rel1);
>
> It cannot be accessed from other sessions. Theoretically it doesn't
> need a lock but NoLock cannot be used there since there's a path that
> doesn't take lock on the relation. But AEL seems too strong and it
> causes unecessary side effect. Couldn't we use weaker locks?

We could use NoLock. I assumed we already hold AccessExclusiveLock, in which
case this has no side effects.

On Thu, Nov 21, 2019 at 04:01:07PM +0900, Kyotaro Horiguchi wrote:
> At Sun, 17 Nov 2019 20:54:34 -0800, Noah Misch <noah(at)leadboat(dot)com> wrote in
> > === Defect 1: gistGetFakeLSN()
> >
> > When I modified pg_regress.c to use wal_level=minimal for all suites,
> > src/test/isolation/specs/predicate-gist.spec failed the assertion in
> > gistGetFakeLSN(). One could reproduce the problem just by running this
> > sequence in psql:
> >
> > begin;
> > create table gist_point_tbl(id int4, p point);
> > create index gist_pointidx on gist_point_tbl using gist(p);
> > insert into gist_point_tbl (id, p)
> > select g, point(g*10, g*10) from generate_series(1, 1000) g;
> >
> > I've included a wrong-in-general hack to make the test pass. I see two main
> > options for fixing this:
> >
> > (a) Introduce an empty WAL record that reserves an LSN and has no other
> > effect. Make GiST use that for permanent relations that are skipping WAL.
> > Further optimizations are possible. For example, we could use a backend-local
> > counter (like the one gistGetFakeLSN() uses for temp relations) until the
> > counter is greater a recent real LSN. That optimization is probably too
> > clever, though it would make the new WAL record almost never appear.
> >
> > (b) Exempt GiST from most WAL skipping. GiST index build could still skip
> > WAL, but it would do its own smgrimmedsync() in addition to the one done at
> > commit. Regular GiST mutations would test RELPERSISTENCE_PERMANENT instead of
> > RelationNeedsWal(), and we'd need some hack for index_copy_data() and possibly
> > other AM-independent code that skips WAL.
> >
> > Overall, I like the cleanliness of (a). The main argument for (b) is that it
> > ensures we have all the features to opt-out of WAL skipping, which could be
> > useful for out-of-tree index access methods. (I think we currently have the
> > features for a tableam to do so, but not for an indexam to do so.) Overall, I
> > lean toward (a). Any other ideas or preferences?
>
> I don't like (b), too.
>
> What we need there is any sequential numbers for page LSN but really
> compatible with real LSN. Couldn't we use GetXLogInsertRecPtr() in the
> case?

No. If nothing is inserting WAL, GetXLogInsertRecPtr() does not increase.
GiST pages need an increasing LSN value.

I noticed an additional defect:

BEGIN;
CREATE TABLE t (c) AS SELECT 1;
CHECKPOINT; -- write and fsync the table's one page
TRUNCATE t; -- no WAL
COMMIT; -- no FPI, just the commit record

If we crash after the COMMIT and before the next fsync or OS-elected sync of
the table's file, the table will stay on disk with its pre-TRUNCATE content.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2019-11-23 21:34:05 Re: backup manifests
Previous Message Fabien COELHO 2019-11-23 18:08:11 Re: Getting psql to redisplay command after \e