Re: tableam vs. TOAST

From: Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>
To: Prabhat Sahu <prabhat(dot)sahu(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: tableam vs. TOAST
Date: 2019-11-05 11:18:12
Message-ID: CAE9k0PmDEfeyRtgvd=C2S0EwUN6+9bwt6ozWdGco9+Y5euuhcw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

From the stack trace shared by Prabhat, I understand that the checkpointer
process panicked due to one of the following two reasons:

1) The fsync() failed in the first attempt itself and the reason for the
failure was not due to file being dropped or truncated i.e. fsync failed
with the error other than ENOENT. Refer to ProcessSyncRequests() for
details esp. the code inside for (failures = 0; !entry->canceled;
failures++) loop.

2) The first attempt to fsync() failed with ENOENT error because just
before the fsync function was called, the file being synced either got
dropped or truncated. When this happened, the checkpointer process called
AbsorbSyncRequests() to update the entry for deleted file in the hash table
but it seems like AbsorbSyncRequests() failed to do so and that's why the
"entry->canceled" couldn't be set to true. Due to this, fsync() was
performed on the same file twice and that failed too. As checkpointer
process doesn't expect the fsync on the same file to fail twice, it
panicked. Again, please check ProcessSyncRequests() for details esp. the
code inside for (failures = 0; !entry->canceled; failures++) loop.

Now, the point of discussion here is, which one of the above two reasons
could the cause for panic? According to me, point #2 doesn't look like the
possible reason for panic. The reason being just before a file is unlinked,
backend first sends a SYNC_FORGET_REQUEST to the checkpointer process which
marks the entry for this file in the hash table as cancelled and then
removes the file. So, with this understanding it is hard to believe that
once the first fsync() for a file has failed with error ENOENT, a call to
AbsorbSyncRequests() made immediately after that wouldn't update the entry
for this file in the hash table because the backend only removes the file
once it has successfully sent the SYNC_FORGET_REQUEST for that file to the
checkpointer process. See mdunlinkfork()->register_forget_request() for
details on this.

So, I think the first point that I mentioned above could be the probable
reason for the checkpointer process getting panicked. But, having said all
that, it would be good to have some evidence for it which can be confirmed
by inspecting the server logfile.

Prabhat, is it possible for you to re-run the test-case with
log_min_messages set to DEBUG1 and save the logfile for the test-case that
crashes. This would be helpful in knowing if the fsync was performed just
once or twice i.e. whether point #1 is the reason for the panic or point
#2.

Thanks,

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

On Thu, Oct 31, 2019 at 10:26 AM Prabhat Sahu <prabhat(dot)sahu(at)enterprisedb(dot)com>
wrote:

>
>
> On Wed, Oct 30, 2019 at 9:46 PM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
>> On Wed, Oct 30, 2019 at 3:49 AM Prabhat Sahu <
>> prabhat(dot)sahu(at)enterprisedb(dot)com> wrote:
>>
>>> While testing the Toast patch(PG+v7 patch) I found below server crash.
>>> System configuration:
>>> VCPUs: 4, RAM: 8GB, Storage: 320GB
>>>
>>> This issue is not frequently reproducible, we need to repeat the same
>>> testcase multiple times.
>>>
>>
>> I wonder if this is an independent bug, because the backtrace doesn't
>> look like it's related to the stuff this is changing. Your report doesn't
>> specify whether you can also reproduce the problem without the patch, which
>> is something that you should always check before reporting a bug in a
>> particular patch.
>>
>
> Hi Robert,
>
> My sincere apologize that I have not mentioned the issue in more detail.
> I have ran the same case against both PG HEAD and HEAD+Patch multiple
> times(7, 10, 20nos), and
> as I found this issue was not failing in HEAD and same case is
> reproducible in HEAD+Patch (again I was not sure about the backtrace
> whether its related to patch or not).
>
>
>
>> --
>> Robert Haas
>> EnterpriseDB: http://www.enterprisedb.com
>> The Enterprise PostgreSQL Company
>>
>
>
> --
>
> With Regards,
>
> Prabhat Kumar Sahu
> Skype ID: prabhat.sahu1984
> EnterpriseDB Software India Pvt. Ltd.
>
> The Postgres Database Company
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Gilles Darold 2019-11-05 11:41:41 Re: [PATCH][DOC] Fix for PREPARE TRANSACTION doc and postgres_fdw message.
Previous Message Grigory Smolkin 2019-11-05 10:39:45 Re: [proposal] recovery_target "latest"