Parallel worker hangs while handling errors.

From: vignesh C <vignesh21(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Parallel worker hangs while handling errors.
Date: 2020-07-03 09:10:56
Message-ID: CALDaNm1d1hHPZUg3xU4XjtWBOLCrA+-2cJcLpw-cePZ=GgDVfA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Parallel worker hangs while handling errors.

Analysis:
When there is an error in the parallel worker process, we will call
ereport/elog with the error message. Worker will then jump from
errfinish to setjmp in StartBackgroundWorker function which was set
earlier. Then the worker process will then send the error message
through the shared memory to the leader process. Shared memory size is
ok 16K, if the error message is less than 16K it works fine. If there
is a bigger error message, the worker process will wait for the leader
process to read the message, free up some memory in shared memory and
set the latch. The worker will be waiting at the below back trace:
#4 0x000000000090480c in WaitLatch (latch=0x7f2b39f6b454,
wakeEvents=33, timeout=0, wait_event_info=134217753) at latch.c:368
#5 0x0000000000787c7f in mq_putmessage (msgtype=69 'E', s=0x2f24350
"SERROR", len=230015) at pqmq.c:171
#6 0x000000000078712e in pq_endmessage (buf=0x7ffe721c4370) at pqformat.c:301
#7 0x0000000000ac1749 in send_message_to_frontend (edata=0xfe91a0
<errordata>) at elog.c:3327
#8 0x0000000000abdf5b in EmitErrorReport () at elog.c:1460

Leader process then identifies that there are some messages that need
to be processed, it copies the messages and sets the latch so that the
worker process can copy the remaining message from the below function:
shm_mq_inc_bytes_read -> SetLatch(&sender->procLatch);, Worker is not
able to receive any signal at this point of time & hangs infinitely
Worker hangs in this case because when the worker is started the
signals will be masked using sigprocmask. Unblocking of signals is
done by calling BackgroundWorkerUnblockSignals in ParallelWorkerMain.
Now due to error handling the worker has jumped to setjmp in
StartBackgroundWorker function. Here the signals are in a blocked
state, hence the signal is not received by the worker process.

One of the fixes could be to call BackgroundWorkerUnblockSignals just
after sigsetjmp. I'm not sure if this is the best solution.
Robert & myself had a discussion about the problem yesterday. We felt
this is a genuine problem with the parallel worker error handling and
need to be fixed.
I could reproduce this issue when there is an error during copy of
toast data using parallel copy, this project is an in-progress
project. I don't have a test case to reproduce on the head. Any
suggestions for a test case on head?
The Attached patch has the fix for the same.

Thoughts?

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
0001-Fix-for-Parallel-worker-hangs-while-handling-errors.patch application/x-patch 1.3 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Gustafsson 2020-07-03 09:21:17 Re: pgbench: option delaying queries till connections establishment?
Previous Message Peter Eisentraut 2020-07-03 08:59:36 Re: Missing "Up" navigation link between parts and doc root?