Re: Backend crash on non-exclusive backup cancel

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: David Steele <david(at)pgmasters(dot)net>
Cc: PostgreSQL mailing lists <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: Backend crash on non-exclusive backup cancel
Date: 2017-03-01 02:08:55
Message-ID: CAB7nPqR9xvq_wYKFjUYF1CECFLRbe6xGhGWaCv_yE8pZU71Ztw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-bugs pgsql-hackers

On Tue, Feb 28, 2017 at 10:21 PM, David Steele <david(at)pgmasters(dot)net> wrote:
> On 2/27/17 10:05 PM, Michael Paquier wrote:
>> That's not necessarily true, I can see a stop backup able to finish as
>> well by issuing a cancel request. It seems to me that we just need to
>> have the shmem information updated at the same time as the
>> session-level switches for consistency and we're good. The
>> inconsistency in places when updating the session-level flags and the
>> shmem-level flags is what is causing harm.
>
> I'm sure this could be done but it will require quite a bit of
> refactoring and I'm not sure that it's worth it. In my mind it would be
> enough to document that cancelled backups should be considered invalid
> whether the stop backup record was written or not. However, I'm willing
> to go with the majority opinion.

I would think that addressing the problem is the way to go, hitting an
assertion in this code path means that we are doing something wrong so
simply removing it does not sound like a correct answer to me.

>>> If that makes sense I'm happy to work up a patch. This is definitely an
>>> edge case and I seriously doubt it is causing any issues in the field.
>>
>> <...>
>>
>> David, are you willing to write a patch or should I? Changing
>> nonexclusive_backup_running/exclusive_backup_running to be an enum
>> would make the code more readable as well IMO.
>
> I would like to see if anyone else weighs in on this first, but yes I am
> planning to write a patch. Agreed on the enum.

This was itching me yesterday so I wrote a patch able to address this
problem... At least it can be registered to the CF on time and give it
more visibility. By the way, I have noticed a second bug in the
current logic while looking at the problem you have reported:
1) Start backup in session 1:
=# select pg_start_backup('popo');
pg_start_backup
-----------------
0/2000028
(1 row)
2) pg_stop_backup() in session 2
3) And now when trying to work on backups with session 1:
=# select pg_start_backup('popo');
ERROR: 55000: a backup is already in progress in this session
LOCATION: pg_start_backup, xlogfuncs.c:87
=# select pg_stop_backup();
ERROR: 55000: exclusive backup not in progress
LOCATION: do_pg_stop_backup, xlog.c:10642

So the handling around exclusive_backup_running is broken now as it is
possible to lock a session easily when handling backups. And the root
of the problem is that checks on exclusive_backup_running are not
necessary. I have fixed as well this problem as per the attached. Note
however that SESSION_BACKUP_EXCLUSIVE still makes sense in the patch
attached when doing checks with pg_stop_backup_v2() for a
non-exclusive backup run. It is also useful for debugging.
--
Michael

Attachment Content-Type Size
backup-session-locks-fixes.patch application/octet-stream 7.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message 钱新林 2017-03-01 02:13:17 Re: help to identify the reason that extension's C function returns array get segmentation fault
Previous Message Haribabu Kommi 2017-03-01 01:59:46 Refactor handling of database attributes between pg_dump and pg_dumpall

Browse pgsql-bugs by date

  From Date Subject
Next Message David Steele 2017-03-01 03:44:53 Re: Backend crash on non-exclusive backup cancel
Previous Message Stephen Frost 2017-02-28 15:53:55 Re: BUG #14543: libpq fails with group readable ssl keys