Quick Links

Re: [HACKERS] WAL logging problem in 9.4.3?

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc:	robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2020-03-21 22:49:20
Message-ID:	20200321224920.GB1763544@rfd.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Sat, Mar 21, 2020 at 12:01:27PM -0700, Noah Misch wrote:
> Pushed, after adding a missing "break" to gist_identify() and tweaking two
> more comments. However, a diverse minority of buildfarm members are failing
> like this, in most branches:
>
> Mar 21 13:16:37 # Failed test 'wal_level = minimal, SET TABLESPACE, hint bit'
> Mar 21 13:16:37 # at t/018_wal_optimize.pl line 231.
> Mar 21 13:16:37 # got: '1'
> Mar 21 13:16:37 # expected: '2'
> Mar 21 13:16:46 # Looks like you failed 1 test of 34.
> Mar 21 13:16:46 [13:16:46] t/018_wal_optimize.pl ................
> -- https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2020-03-21%2016%3A52%3A05
>
> Since I run two of the failing animals, I expect to reproduce this soon.

force_parallel_regress was the setting needed to reproduce this:

printf '%s\n%s\n%s\n' 'log_statement = all' 'force_parallel_mode = regress' >/tmp/force_parallel.conf
make -C src/test/recovery check PROVE_TESTS=t/018_wal_optimize.pl TEMP_CONFIG=/tmp/force_parallel.conf

The proximate cause is the RelFileNodeSkippingWAL() call that we added to
MarkBufferDirtyHint(). MarkBufferDirtyHint() runs in parallel workers, but
parallel workers have zeroes for pendingSyncHash and rd_*Subid. I hacked up
the attached patch to understand the scope of the problem (not to commit). It
logs a message whenever a parallel worker uses pendingSyncHash or
RelationNeedsWAL(). Some of the cases happen often enough to make logs huge,
so the patch suppresses logging for them. You can see the lower-volume calls
like this:

printf '%s\n%s\n%s\n%s\n' 'log_statement = all' 'wal_level = minimal' 'max_wal_senders = 0' 'force_parallel_mode = regress' >/tmp/minimal_parallel.conf
make check-world TEMP_CONFIG=/tmp/minimal_parallel.conf
find . -name log | xargs grep -rl 'nm0 invalid'

Not all are actual bugs. For example, get_relation_info() behaves fine:

/* Temporary and unlogged relations are inaccessible during recovery. */
if (!RelationNeedsWAL(relation) && RecoveryInProgress())

Kyotaro, can you look through the affected code and propose a strategy for
good coexistence of parallel query with the WAL skipping mechanism?

Since I don't expect one strategy to win clearly and quickly, I plan to revert
the main patch around 2020-03-22 17:30 UTC. That will give the patch about
twenty-four hours in the buildfarm, so more animals can report in. I will
leave the three smaller patches in place.

> fairywren failed differently on 9.5; I have not yet studied it:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2020-03-21%2018%3A01%3A10

This did not remain specific to 9.5. On platforms where SIZEOF_SIZE_T==4 or
SIZEOF_LONG==4, wal_skip_threshold cannot exceed 2GB. A simple s/1TB/1GB/ in
the test should fix this.

Attachment	Content-Type	Size
debug-parallel-skip-wal-v0.patch	text/plain	5.1 KB

In response to

Re: [HACKERS] WAL logging problem in 9.4.3? at 2020-03-21 19:01:27 from Noah Misch

Responses

Re: [HACKERS] WAL logging problem in 9.4.3? at 2020-03-23 08:20:27 from Kyotaro Horiguchi

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2020-03-21 23:22:41	Re: Refactor compile-time assertion checks for C/C++
Previous Message	Bruce Momjian	2020-03-21 22:13:03	Re: Ecpg dependency