Re: tableam vs. TOAST

From: Prabhat Sahu <prabhat(dot)sahu(at)enterprisedb(dot)com>
To: Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: tableam vs. TOAST
Date: 2019-11-07 05:27:20
Message-ID: CANEvxPpKFwhaddt0TSn7tmBJjgY4SNs=AB61Rx5f7M6EaLXcPA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Nov 5, 2019 at 4:48 PM Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>
wrote:

> From the stack trace shared by Prabhat, I understand that the checkpointer
> process panicked due to one of the following two reasons:
>
> 1) The fsync() failed in the first attempt itself and the reason for the
> failure was not due to file being dropped or truncated i.e. fsync failed
> with the error other than ENOENT. Refer to ProcessSyncRequests() for
> details esp. the code inside for (failures = 0; !entry->canceled;
> failures++) loop.
>
> 2) The first attempt to fsync() failed with ENOENT error because just
> before the fsync function was called, the file being synced either got
> dropped or truncated. When this happened, the checkpointer process called
> AbsorbSyncRequests() to update the entry for deleted file in the hash table
> but it seems like AbsorbSyncRequests() failed to do so and that's why the
> "entry->canceled" couldn't be set to true. Due to this, fsync() was
> performed on the same file twice and that failed too. As checkpointer
> process doesn't expect the fsync on the same file to fail twice, it
> panicked. Again, please check ProcessSyncRequests() for details esp. the
> code inside for (failures = 0; !entry->canceled; failures++) loop.
>
> Now, the point of discussion here is, which one of the above two reasons
> could the cause for panic? According to me, point #2 doesn't look like the
> possible reason for panic. The reason being just before a file is unlinked,
> backend first sends a SYNC_FORGET_REQUEST to the checkpointer process which
> marks the entry for this file in the hash table as cancelled and then
> removes the file. So, with this understanding it is hard to believe that
> once the first fsync() for a file has failed with error ENOENT, a call to
> AbsorbSyncRequests() made immediately after that wouldn't update the entry
> for this file in the hash table because the backend only removes the file
> once it has successfully sent the SYNC_FORGET_REQUEST for that file to the
> checkpointer process. See mdunlinkfork()->register_forget_request() for
> details on this.
>
> So, I think the first point that I mentioned above could be the probable
> reason for the checkpointer process getting panicked. But, having said all
> that, it would be good to have some evidence for it which can be confirmed
> by inspecting the server logfile.
>
> Prabhat, is it possible for you to re-run the test-case with
> log_min_messages set to DEBUG1 and save the logfile for the test-case that
> crashes. This would be helpful in knowing if the fsync was performed just
> once or twice i.e. whether point #1 is the reason for the panic or point
> #2.
>

I have ran the same testcases with and without patch multiple times with
debug option (log_min_messages = DEBUG1), but this time I am not able to
reproduce the crash.

>
> Thanks,
>
> --
> With Regards,
> Ashutosh Sharma
> EnterpriseDB:http://www.enterprisedb.com
>
> On Thu, Oct 31, 2019 at 10:26 AM Prabhat Sahu <
> prabhat(dot)sahu(at)enterprisedb(dot)com> wrote:
>
>>
>>
>> On Wed, Oct 30, 2019 at 9:46 PM Robert Haas <robertmhaas(at)gmail(dot)com>
>> wrote:
>>
>>> On Wed, Oct 30, 2019 at 3:49 AM Prabhat Sahu <
>>> prabhat(dot)sahu(at)enterprisedb(dot)com> wrote:
>>>
>>>> While testing the Toast patch(PG+v7 patch) I found below server crash.
>>>> System configuration:
>>>> VCPUs: 4, RAM: 8GB, Storage: 320GB
>>>>
>>>> This issue is not frequently reproducible, we need to repeat the same
>>>> testcase multiple times.
>>>>
>>>
>>> I wonder if this is an independent bug, because the backtrace doesn't
>>> look like it's related to the stuff this is changing. Your report doesn't
>>> specify whether you can also reproduce the problem without the patch, which
>>> is something that you should always check before reporting a bug in a
>>> particular patch.
>>>
>>
>> Hi Robert,
>>
>> My sincere apologize that I have not mentioned the issue in more detail.
>> I have ran the same case against both PG HEAD and HEAD+Patch multiple
>> times(7, 10, 20nos), and
>> as I found this issue was not failing in HEAD and same case is
>> reproducible in HEAD+Patch (again I was not sure about the backtrace
>> whether its related to patch or not).
>>
>>
>>
>>> --
>>> Robert Haas
>>> EnterpriseDB: http://www.enterprisedb.com
>>> The Enterprise PostgreSQL Company
>>>
>>
>>
>> --
>>
>> With Regards,
>>
>> Prabhat Kumar Sahu
>> Skype ID: prabhat.sahu1984
>> EnterpriseDB Software India Pvt. Ltd.
>>
>> The Postgres Database Company
>>
>

--

With Regards,

Prabhat Kumar Sahu
Skype ID: prabhat.sahu1984
EnterpriseDB Software India Pvt. Ltd.

The Postgres Database Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dilip Kumar 2019-11-07 05:31:17 Re: Reorderbuffer crash during recovery
Previous Message Pavel Stehule 2019-11-07 05:15:31 Re: dropdb --force