Re: [PATCH] Fix Proposal - Deadlock Issue in Single User Mode When IO Failure Occurs

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Chengchao Yu <chengyu(at)microsoft(dot)com>
Cc: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Prabhat Tripathi <ptrip(at)microsoft(dot)com>, Sunil Kamath <Sunil(dot)Kamath(at)microsoft(dot)com>, Michal Primke <mprimke(at)microsoft(dot)com>, TEJA Mupparti <Tejeswar(dot)Mupparti(at)microsoft(dot)com>
Subject: Re: [PATCH] Fix Proposal - Deadlock Issue in Single User Mode When IO Failure Occurs
Date: 2019-09-09 12:04:43
Message-ID: CAA4eK1Ju6aghDmuzbaUn75C9BUGzqkd=Lxkewr52SzfPobSGGw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Jul 27, 2019 at 6:22 AM Chengchao Yu <chengyu(at)microsoft(dot)com> wrote:
>
> Thus, I have updated the patch v3 according to your suggestions. Could you help to review again?
> Please let me know should you have more suggestions or feedbacks.
>

I have tried to look into this patch and I don't think it fixes the
problem. Basically, I have tried the commands suggested by you in
single-user mode, create table; insert and then checkpoint. Now, what
I see is almost the same behavior as explained by you in one of the
above emails with a slight difference which makes me think that the
fix you are proposing is not correct. Below is what you told:

"The second type is in Step #4. At the time when “checkpoint” SQL
command is being executed, PG has already set up the before_shmem_exit
callbackShutdownPostgres(), which releases all lw-locks given
transaction or sub-transaction is on-going. So after the first IO
error, the buffer page’s lw-lock gets released successfully. However,
later ShutdownXLOG() is invoked, and PG tries to flush buffer pages
again, which results in the second IO error. Different from the first
time, this time, all the previous executed before/on_shmem_exit
callbacks are not invoked again due to the decrease of the indexes. So
lw-locks for buffer pages are not released when PG tries to get the
same buffer lock in AbortBufferIO(), and then PG process gets stuck."

The only difference is in the last line where for me it gives
assertion failure when trying to do ReleaseAuxProcessResources. Below
is the callstack:

postgres.exe!ExceptionalCondition(const char *
conditionName=0x00db0c78, const char * errorType=0x00db0c68, const
char * fileName=0x00db0c18, int lineNumber=1722) Line 55 C
postgres.exe!UnpinBuffer(BufferDesc * buf=0x052a104c, bool
fixOwner=true) Line 1722 + 0x2f bytes C
postgres.exe!ReleaseBuffer(int buffer=96) Line 3367 + 0x17 bytes C
postgres.exe!ResourceOwnerReleaseInternal(ResourceOwnerData *
owner=0x0141f6e8, <unnamed-enum-RESOURCE_RELEASE_BEFORE_LOCKS>
phase=RESOURCE_RELEASE_BEFORE_LOCKS, bool isCommit=false, bool
isTopLevel=true) Line 526 + 0x9 bytes C
postgres.exe!ResourceOwnerRelease(ResourceOwnerData *
owner=0x0141f6e8, <unnamed-enum-RESOURCE_RELEASE_BEFORE_LOCKS>
phase=RESOURCE_RELEASE_BEFORE_LOCKS, bool isCommit=false, bool
isTopLevel=true) Line 484 + 0x17 bytes C
postgres.exe!ReleaseAuxProcessResources(bool isCommit=false) Line
861 + 0x15 bytes C
> postgres.exe!ReleaseAuxProcessResourcesCallback(int code=1, unsigned int arg=0) Line 881 + 0xa bytes C
postgres.exe!shmem_exit(int code=1) Line 272 + 0x1f bytes C
postgres.exe!proc_exit_prepare(int code=1) Line 194 + 0x9 bytes C
postgres.exe!proc_exit(int code=1) Line 107 + 0x9 bytes C
postgres.exe!errfinish(int dummy=0, ...) Line 538 + 0x7 bytes C
postgres.exe!mdwrite(SMgrRelationData * reln=0x0147e140, ForkNumber
forknum=MAIN_FORKNUM, unsigned int blocknum=7, char *
buffer=0x0542dd00, bool skipFsync=false) Line 713 + 0x4c bytes C
postgres.exe!smgrwrite(SMgrRelationData * reln=0x0147e140,
ForkNumber forknum=MAIN_FORKNUM, unsigned int blocknum=7, char *
buffer=0x0542dd00, bool skipFsync=false) Line 587 + 0x24 bytes C
postgres.exe!FlushBuffer(BufferDesc * buf=0x052a104c,
SMgrRelationData * reln=0x0147e140) Line 2759 + 0x1d bytes C
postgres.exe!SyncOneBuffer(int buf_id=95, bool
skip_recently_used=false, WritebackContext * wb_context=0x012ccea0)
Line 2402 + 0xb bytes C
postgres.exe!BufferSync(int flags=5) Line 1992 + 0x15 bytes C
postgres.exe!CheckPointBuffers(int flags=5) Line 2586 + 0x9 bytes C
postgres.exe!CheckPointGuts(unsigned __int64
checkPointRedo=22933176, int flags=5) Line 8991 + 0x9 bytes C
postgres.exe!CreateCheckPoint(int flags=5) Line 8780 + 0x11 bytes C
postgres.exe!ShutdownXLOG(int code=1, unsigned int arg=0) Line 8333
+ 0x7 bytes C
postgres.exe!shmem_exit(int code=1) Line 272 + 0x1f bytes C
postgres.exe!proc_exit_prepare(int code=1) Line 194 + 0x9 bytes C
postgres.exe!proc_exit(int code=1) Line 107 + 0x9 bytes C
postgres.exe!errfinish(int dummy=0, ...) Line 538 + 0x7 bytes C
postgres.exe!mdwrite(SMgrRelationData * reln=0x0147e140, ForkNumber
forknum=MAIN_FORKNUM, unsigned int blocknum=7, char *
buffer=0x0542dd00, bool skipFsync=false) Line 713 + 0x4c bytes C
postgres.exe!smgrwrite(SMgrRelationData * reln=0x0147e140,
ForkNumber forknum=MAIN_FORKNUM, unsigned int blocknum=7, char *
buffer=0x0542dd00, bool skipFsync=false) Line 587 + 0x24 bytes C
postgres.exe!FlushBuffer(BufferDesc * buf=0x052a104c,
SMgrRelationData * reln=0x0147e140) Line 2759 + 0x1d bytes C
postgres.exe!SyncOneBuffer(int buf_id=95, bool
skip_recently_used=false, WritebackContext * wb_context=0x012ce580)
Line 2402 + 0xb bytes C
postgres.exe!BufferSync(int flags=44) Line 1992 + 0x15 bytes C
postgres.exe!CheckPointBuffers(int flags=44) Line 2586 + 0x9 bytes C
postgres.exe!CheckPointGuts(unsigned __int64
checkPointRedo=22933176, int flags=44) Line 8991 + 0x9 bytes C
postgres.exe!CreateCheckPoint(int flags=44) Line 8780 + 0x11 bytes C
postgres.exe!RequestCheckpoint(int flags=44) Line 967 + 0xc bytes C
postgres.exe!standard_ProcessUtility(PlannedStmt * pstmt=0x0146b738,
const char * queryString=0x0146ad98,
<unnamed-enum-PROCESS_UTILITY_TOPLEVEL>
context=PROCESS_UTILITY_TOPLEVEL, ParamListInfoData *
params=0x00000000, QueryEnvironment * queryEnv=0x00000000,
_DestReceiver * dest=0x00adc1d8, char * completionTag=0x012cfdbc)
Line 769 + 0x28 bytes C

It seems to me there are other things like
ReleaseAuxProcessResources() before AbortBufferIO() which expects
LWLocks to be released. I didn't get much time to further debug this,
but I think some more analysis is required for this issue.

I guess you didn't encounter this problem because you are not using
Asserts enabled build, but there could be some other reason as well.

I have marked this CF entry as "Waiting on Author".

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2019-09-09 12:07:02 Set of header files for Ryu floating-point stuff in src/common/
Previous Message Michael Paquier 2019-09-09 11:57:46 Re: refactoring - share str2*int64 functions