Funny hang on PostgreSQL 10 during parallel index scan on slave

From: Chris Travers <chris(dot)travers(at)adjust(dot)com>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Funny hang on PostgreSQL 10 during parallel index scan on slave
Date: 2018-09-05 15:22:36
Message-ID: CAN-RpxBV0-EZhHSEMrZ3eTZGWH-tK40ZFEm4f3oiGavCEoX3nw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi all;

For the last few months we have been facing a funny problem on a slave
where queries go to 100% cpu usage and never finish, causing the recovery
process to hang and the replica to fall behind, Over time we ruled out a
lot of causes and were banging our heads against this one. Today we got a
break in it when we attached a debugger to various processes even without
debugging symbols. Not only did we get useful stack traces from the hung
query but attaching a debugger to the startup process caused the query to
finish. This has shown up in 10.2 and 10.5.

Based on the stack traces we have concluded the following seems to happen:

1. The query is in a parallel index scan or similar
2. A process is executing a parallel plan and allocating a significant
chunk of memory (2MB for example) in dynamic shared memory.
3. The startup process goes into a loop where it sends a sigusr1, sleeps
5m, and sends another sigusr1 etc.
4. The sigusr1 aborts the system call, which is then retried.
5. Because the system call takes more than 5ms, we end up in an endless
loop

I see one of two possible solutions here.
1. Exponential backoff in sending signals to maybe 1s max.
2. If there is any way to check for signals before retrying the system
call (which I am not 100% sure where it is yet but on my way).

Any feedback or thoughts before we look at implementing a patch?
--
Best Regards,
Chris Travers
Head of Database

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com
Saarbrücker Straße 37a, 10405 Berlin

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bossart, Nathan 2018-09-05 15:24:21 Re: Add SKIP LOCKED to VACUUM and ANALYZE
Previous Message Thomas Munro 2018-09-05 15:20:19 Re: Collation versioning