Re: Backend crash on non-exclusive backup cancel

From: David Steele <david(at)pgmasters(dot)net>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: PostgreSQL mailing lists <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: Backend crash on non-exclusive backup cancel
Date: 2017-03-01 03:44:53
Message-ID: 6a8f9d7e-ce4e-d850-12d3-7787815447ea@pgmasters.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

On 2/28/17 9:08 PM, Michael Paquier wrote:
> On Tue, Feb 28, 2017 at 10:21 PM, David Steele <david(at)pgmasters(dot)net> wrote:
>> On 2/27/17 10:05 PM, Michael Paquier wrote:
>>> That's not necessarily true, I can see a stop backup able to finish as
>>> well by issuing a cancel request. It seems to me that we just need to
>>> have the shmem information updated at the same time as the
>>> session-level switches for consistency and we're good. The
>>> inconsistency in places when updating the session-level flags and the
>>> shmem-level flags is what is causing harm.
>>
>> I'm sure this could be done but it will require quite a bit of
>> refactoring and I'm not sure that it's worth it. In my mind it would be
>> enough to document that cancelled backups should be considered invalid
>> whether the stop backup record was written or not. However, I'm willing
>> to go with the majority opinion.
>
> I would think that addressing the problem is the way to go, hitting an
> assertion in this code path means that we are doing something wrong so
> simply removing it does not sound like a correct answer to me.
>
>>>> If that makes sense I'm happy to work up a patch. This is definitely an
>>>> edge case and I seriously doubt it is causing any issues in the field.
>>>
>>> <...>
>>>
>>> David, are you willing to write a patch or should I? Changing
>>> nonexclusive_backup_running/exclusive_backup_running to be an enum
>>> would make the code more readable as well IMO.
>>
>> I would like to see if anyone else weighs in on this first, but yes I am
>> planning to write a patch. Agreed on the enum.
>
> This was itching me yesterday so I wrote a patch able to address this
> problem...

I know that feeling.

> At least it can be registered to the CF on time and give it
> more visibility.

Indeed.

> By the way, I have noticed a second bug in the
> current logic while looking at the problem you have reported:
> 1) Start backup in session 1:
> =# select pg_start_backup('popo');
> pg_start_backup
> -----------------
> 0/2000028
> (1 row)
> 2) pg_stop_backup() in session 2
> 3) And now when trying to work on backups with session 1:
> =# select pg_start_backup('popo');
> ERROR: 55000: a backup is already in progress in this session
> LOCATION: pg_start_backup, xlogfuncs.c:87
> =# select pg_stop_backup();
> ERROR: 55000: exclusive backup not in progress
> LOCATION: do_pg_stop_backup, xlog.c:10642

Yeah, that doesn't look good.

> So the handling around exclusive_backup_running is broken now as it is
> possible to lock a session easily when handling backups. And the root
> of the problem is that checks on exclusive_backup_running are not
> necessary. I have fixed as well this problem as per the attached. Note
> however that SESSION_BACKUP_EXCLUSIVE still makes sense in the patch
> attached when doing checks with pg_stop_backup_v2() for a
> non-exclusive backup run. It is also useful for debugging.

Excellent! I'll have a look, and thanks for working up a patch.

--
-David
david(at)pgmasters(dot)net

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message rick 2017-03-01 16:08:41 BUG #14572: pgadmin restore command path error
Previous Message Michael Paquier 2017-03-01 02:08:55 Re: Backend crash on non-exclusive backup cancel

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2017-03-01 03:45:53 Re: SQL/JSON in PostgreSQL
Previous Message Kuntal Ghosh 2017-03-01 03:43:15 Re: Performance degradation in TPC-H Q18