Re: Problem with synchronous replication

From: lingce(dot)ldm <lingce(dot)ldm(at)alibaba-inc(dot)com>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Problem with synchronous replication
Date: 2019-10-30 06:27:33
Message-ID: 4BA09384-F86E-4A3A-AA5D-544C8B3F32B5@alibaba-inc.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Oct 29, 2019, at 18:50, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com <mailto:horikyota(dot)ntt(at)gmail(dot)com>> wrote:
>
> Hello.
>
> At Fri, 25 Oct 2019 15:18:34 +0800, "Dongming Liu" <lingce(dot)ldm(at)alibaba-inc(dot)com <mailto:lingce(dot)ldm(at)alibaba-inc(dot)com>> wrote in
>>
>> Hi,
>>
>> I recently discovered two possible bugs about synchronous replication.
>>
>> 1. SyncRepCleanupAtProcExit may delete an element that has been deleted
>> SyncRepCleanupAtProcExit first checks whether the queue is detached, if it is not detached,
>> acquires the SyncRepLock lock and deletes it. If this element has been deleted by walsender,
>> it will be deleted repeatedly, SHMQueueDelete will core with a segment fault.
>>
>> IMO, like SyncRepCancelWait, we should lock the SyncRepLock first and then check
>> whether the queue is detached or not.
>
> I think you're right here.

Thanks.

>
>> 2. SyncRepWaitForLSN may not call SyncRepCancelWait if ereport check one interrupt.
>> For SyncRepWaitForLSN, if a query cancel interrupt arrives, we just terminate the wait
>> with suitable warning. As follows:
>>
>> a. set QueryCancelPending to false
>> b. errport outputs one warning
>> c. calls SyncRepCancelWait to delete one element from the queue
>>
>> If another cancel interrupt arrives when we are outputting warning at step b, the errfinish
>> will call CHECK_FOR_INTERRUPTS that will output an ERROR, such as "canceling autovacuum
>> task", then the process will jump to the sigsetjmp. Unfortunately, the step c will be skipped
>> and the element that should be deleted by SyncRepCancelWait is remained.
>>
>> The easiest way to fix this is to swap the order of step b and step c. On the other hand,
>> let sigsetjmp clean up the queue may also be a good choice. What do you think?
>>
>> Attached the patch, any feedback is greatly appreciated.
>
> This is not right. It is in transaction commit so it is in a
> HOLD_INTERRUPTS section. ProcessInterrupt does not respond to
> cancel/die interrupts thus the ereport should return.

I read the relevant code, you are right. I called SyncRepWaitForLSN somewhere else,
but forgot to put it in a HOLD_INTERRUPTS and triggered an exception.

regards.


Dongming Liu

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message lingce.ldm 2019-10-30 06:27:46 Re: Problem with synchronous replication
Previous Message Dilip Kumar 2019-10-30 06:12:23 Re: [HACKERS] Block level parallel vacuum