Re: [HACKERS] WAL logging problem in 9.4.3?

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: noah(at)leadboat(dot)com
Cc: robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, michael(at)paquier(dot)xyz
Subject: Re: [HACKERS] WAL logging problem in 9.4.3?
Date: 2019-11-25 02:08:54
Message-ID: 20191125.110854.1439648354665518969.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

At Sat, 23 Nov 2019 16:21:36 -0500, Noah Misch <noah(at)leadboat(dot)com> wrote in
> On Wed, Nov 20, 2019 at 03:05:46PM +0900, Kyotaro Horiguchi wrote:
> > By the way, before finalize this, I'd like to share the result of a
> > brief benchmarking.
>
> What non-default settings did you use? Please give the output of this or a
> similar command:

Only wal_level=minimal and max_wal_senders=0.

> select name, setting from pg_settings where setting <> boot_val;
>
> If you run more benchmarks and weren't already using wal_buffers=16MB, I
> recommend using it.

Roger.

> > With 10 pgbench sessions.
> > pages SYNC WAL
> > 1: 915 ms 301 ms
> > 3: 1634 ms 508 ms
> > 5: 1634 ms 293ms
> > 10: 1671 ms 1043 ms
> > 17: 1600 ms 333 ms
> > 31: 1864 ms 314 ms
> > 56: 1562 ms 448 ms
> > 100: 1538 ms 394 ms
> > 177: 1697 ms 1047 ms
> > 316: 3074 ms 1788 ms
> > 562: 3306 ms 1245 ms
> > 1000: 3440 ms 2182 ms
> > 1778: 5064 ms 6464 ms # WAL's slope becomes steep
> > 3162: 8675 ms 8165 ms
>
> For picking a default wal_skip_threshold, it would have been more informative
> to see how this changes pgbench latency statistics. Some people want DDL to
> be fast, but more people want DDL not to reduce the performance of concurrent
> non-DDL. This benchmark procedure may help:
>
> 1. Determine $DDL_COUNT, a number of DDL transactions that take about one
> minute when done via syncs.
> 2. Start "pgbench -rP1 --progress-timestamp -T180 -c10 -j10".
> 3. Wait 10s.
> 4. Start one DDL backend that runs $DDL_COUNT transactions.
> 5. Save DDL start timestamp, DDL end timestamp, and pgbench output.
>
> I would compare pgbench tps and latency between the seconds when DDL is and is
> not running. As you did in earlier tests, I would repeat it using various
> page counts, with and without sync.

I understood the "DDL" is not pure DDLs but a kind of
define-then-load, like "CREATE TABLE AS" , "CREATE TABLE" then "COPY
FROM".

> On Wed, Nov 20, 2019 at 05:31:43PM +0900, Kyotaro Horiguchi wrote:
> > +Prefer to do the same in future access methods. However, two other approaches
> > +can work. First, an access method can irreversibly transition a given fork
> > +from WAL-skipping to WAL-writing by calling FlushRelationBuffers() and
> > +smgrimmedsync(). Second, an access method can opt to write WAL
> > +unconditionally for permanent relations. When using the second method, do not
> > +call RelationCopyStorage(), which skips WAL.
> >
> > Even using these methods, TransactionCommit flushes out buffers then
> > sync files again. Isn't a description something like the following
> > needed?
> >
> > ===
> > Even an access method switched a in-transaction created relfilenode to
> > WAL-writing, Commit(Prepare)Transaction flushed all buffers for the
> > file then smgrimmedsync() the file.
> > ===
>
> It is enough that the text says to prefer the approach that core access
> methods use. The extra flush and sync when using a non-preferred approach
> wastes some performance, but it is otherwise harmless.

Ah, right and I agreed.

> > + rel1 = relation_open(r1, AccessExclusiveLock);
> > + RelationAssumeNewRelfilenode(rel1);
> >
> > It cannot be accessed from other sessions. Theoretically it doesn't
> > need a lock but NoLock cannot be used there since there's a path that
> > doesn't take lock on the relation. But AEL seems too strong and it
> > causes unecessary side effect. Couldn't we use weaker locks?
>
> We could use NoLock. I assumed we already hold AccessExclusiveLock, in which
> case this has no side effects.

I forgot that this optimization is used only in non-replication
configuragion. So I agree that AEL doesn't have no side
effect.

> On Thu, Nov 21, 2019 at 04:01:07PM +0900, Kyotaro Horiguchi wrote:
> > At Sun, 17 Nov 2019 20:54:34 -0800, Noah Misch <noah(at)leadboat(dot)com> wrote in
> > > === Defect 1: gistGetFakeLSN()
> > >
> > > When I modified pg_regress.c to use wal_level=minimal for all suites,
> > > src/test/isolation/specs/predicate-gist.spec failed the assertion in
> > > gistGetFakeLSN(). One could reproduce the problem just by running this
> > > sequence in psql:
> > >
> > > begin;
> > > create table gist_point_tbl(id int4, p point);
> > > create index gist_pointidx on gist_point_tbl using gist(p);
> > > insert into gist_point_tbl (id, p)
> > > select g, point(g*10, g*10) from generate_series(1, 1000) g;
> > >
> > > I've included a wrong-in-general hack to make the test pass. I see two main
> > > options for fixing this:
> > >
> > > (a) Introduce an empty WAL record that reserves an LSN and has no other
> > > effect. Make GiST use that for permanent relations that are skipping WAL.
> > > Further optimizations are possible. For example, we could use a backend-local
> > > counter (like the one gistGetFakeLSN() uses for temp relations) until the
> > > counter is greater a recent real LSN. That optimization is probably too
> > > clever, though it would make the new WAL record almost never appear.
> > >
> > > (b) Exempt GiST from most WAL skipping. GiST index build could still skip
> > > WAL, but it would do its own smgrimmedsync() in addition to the one done at
> > > commit. Regular GiST mutations would test RELPERSISTENCE_PERMANENT instead of
> > > RelationNeedsWal(), and we'd need some hack for index_copy_data() and possibly
> > > other AM-independent code that skips WAL.
> > >
> > > Overall, I like the cleanliness of (a). The main argument for (b) is that it
> > > ensures we have all the features to opt-out of WAL skipping, which could be
> > > useful for out-of-tree index access methods. (I think we currently have the
> > > features for a tableam to do so, but not for an indexam to do so.) Overall, I
> > > lean toward (a). Any other ideas or preferences?
> >
> > I don't like (b), too.
> >
> > What we need there is any sequential numbers for page LSN but really
> > compatible with real LSN. Couldn't we use GetXLogInsertRecPtr() in the
> > case?
>
> No. If nothing is inserting WAL, GetXLogInsertRecPtr() does not increase.
> GiST pages need an increasing LSN value.

Sorry, I noticed that after the mail went out. I agree to (a) and will
do that.

> I noticed an additional defect:
>
> BEGIN;
> CREATE TABLE t (c) AS SELECT 1;
> CHECKPOINT; -- write and fsync the table's one page
> TRUNCATE t; -- no WAL
> COMMIT; -- no FPI, just the commit record
>
> If we crash after the COMMIT and before the next fsync or OS-elected sync of
> the table's file, the table will stay on disk with its pre-TRUNCATE content.

The TRUNCATE replaces relfilenode in the catalog and the pre-TRUNCATE
content wouldn't be seen after COMMIT. Since the file has no pages,
it's right that no FPI emitted. What we should make sure the empty
file's metadata is synced out. But I think that kind of failure
shoudn't happen on modern file systems. If we don't rely on such
behavior, we can make sure thhat by turning the zero-pages case from
WAL into file sync. I'll do that in the next version.

I'll post the next version as a single patch.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2019-11-25 02:28:58 Re: Safeguards against incorrect fd flags for fsync()
Previous Message Ranier Vilela 2019-11-25 01:18:28 RE: [PATCH] Style: fix function declaration