Re: WIP: WAL prefetch (another approach)

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Andres Freund <andres(at)anarazel(dot)de>, Jakub Wartak <Jakub(dot)Wartak(at)tomtom(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, David Steele <david(at)pgmasters(dot)net>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: WAL prefetch (another approach)
Date: 2021-04-21 16:30:22
Message-ID: 773932.1619022622@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thomas Munro <thomas(dot)munro(at)gmail(dot)com> writes:
> Yeah, it would have been nice to include that but it'll have to be for
> v15 due to lack of time to convince myself that it was correct. I do
> intend to look into more concurrency of that kind for v15. I have
> pushed these patches, updated to be disabled by default.

I have a fairly bad feeling about these patches. I've already fixed
one critical bug (see 9e4114822), but I am still seeing random, hard
to reproduce failures in WAL replay testing. It looks like sometimes
the "decoded" version of a WAL record doesn't match what I see in
the on-disk data, which I'm having no luck tracing down.

Another interesting failure I just came across is

2021-04-21 11:32:14.280 EDT [14606] LOG: incorrect resource manager data checksum in record at F/438000A4
TRAP: FailedAssertion("state->decoding", File: "xlogreader.c", Line: 845, PID: 14606)
2021-04-21 11:38:23.066 EDT [14603] LOG: startup process (PID 14606) was terminated by signal 6: Abort trap

with stack trace

#0 0x90b669f0 in kill ()
#1 0x90c01bfc in abort ()
#2 0x0057a6a0 in ExceptionalCondition (conditionName=<value temporarily unavailable, due to optimizations>, errorType=<value temporarily unavailable, due to optimizations>, fileName=<value temporarily unavailable, due to optimizations>, lineNumber=<value temporarily unavailable, due to optimizations>) at assert.c:69
#3 0x000f5cf4 in XLogDecodeOneRecord (state=0x1000640, allow_oversized=1 '\001') at xlogreader.c:845
#4 0x000f682c in XLogNextRecord (state=0x1000640, record=0xbfffba38, errormsg=0xbfffba9c) at xlogreader.c:466
#5 0x000f695c in XLogReadRecord (state=<value temporarily unavailable, due to optimizations>, record=0xbfffba98, errormsg=<value temporarily unavailable, due to optimizations>) at xlogreader.c:352
#6 0x000e61a0 in ReadRecord (xlogreader=0x1000640, emode=15, fetching_ckpt=0 '\0') at xlog.c:4398
#7 0x000ea320 in StartupXLOG () at xlog.c:7567
#8 0x00362218 in StartupProcessMain () at startup.c:244
#9 0x000fc170 in AuxiliaryProcessMain (argc=<value temporarily unavailable, due to optimizations>, argv=<value temporarily unavailable, due to optimizations>) at bootstrap.c:447
#10 0x0035c740 in StartChildProcess (type=StartupProcess) at postmaster.c:5439
#11 0x00360f4c in PostmasterMain (argc=5, argv=0xa006a0) at postmaster.c:1406
#12 0x0029737c in main (argc=<value temporarily unavailable, due to optimizations>, argv=<value temporarily unavailable, due to optimizations>) at main.c:209

I am not sure whether the checksum failure itself is real or a variant
of the seeming bad-reconstruction problem, but what I'm on about right
at this moment is that the error handling logic for this case seems
quite broken. Why is a checksum failure only worthy of a LOG message?
Why is ValidXLogRecord() issuing a log message for itself, rather than
being tied into the report_invalid_record() mechanism? Why are we
evidently still trying to decode records afterwards?

In general, I'm not too pleased with the apparent attitude in this
thread that it's okay to push a patch that only mostly works on the
last day of the dev cycle and plan to stabilize it later.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2021-04-21 17:40:35 Re: PATCH: Add GSSAPI ccache_name option to libpq
Previous Message Stefan Keller 2021-04-21 16:01:04 Re: ML-based indexing ("The Case for Learned Index Structures", a paper from Google)