| From: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> | 
|---|---|
| To: | Michael Paquier <michael(dot)paquier(at)gmail(dot)com> | 
| Cc: | Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> | 
| Subject: | Re: [HACKERS] Assertion failure when the non-exclusive pg_stop_backup aborted. | 
| Date: | 2017-11-20 07:12:29 | 
| Message-ID: | CAD21AoAt=RJLVU3=DVZJDoC9WYOVErYLc1Ku=nzxx8JsAZXBfA@mail.gmail.com | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-hackers | 
On Fri, Nov 17, 2017 at 11:00 AM, Michael Paquier
<michael(dot)paquier(at)gmail(dot)com> wrote:
> On Thu, Nov 16, 2017 at 8:17 PM, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>> On Thu, Nov 16, 2017 at 1:11 PM, Michael Paquier
>> <michael(dot)paquier(at)gmail(dot)com> wrote:
>>> +   /*
>>> +    * Quick exit if session is not keeping around a non-exclusive backup
>>> +    * already started.
>>> +    */
>>> +   if (sessionBackupState != SESSION_BACKUP_NON_EXCLUSIVE)
>>> +       return;
>>> I think that it would be more solid to use SESSION_BACKUP_NONE for the
>>> comparison, and complete the assertion after the quick exit as follows
>>> as this code path should never be taken for an exclusive backup:
>>
>> Agreed.
>
> I have spent some time doing an extra lookup with tests involving one
> and two sessions doing backups checking for failure code paths while
> waiting for archives:
> - One session with non-exclusive backup.
> - One session with exclusive backup.
> - One session is exclusive and the second is non-exclusive
> - Both are exclusive.
> Also double-checked on the way the logic around the cleanup callback
> and sessionBackupState during startup, and those look clean to me. One
> thing that was bothering me is that the callback
> nonexclusive_base_backup_cleanup is called after do_pg_start_backup()
> finishes. But between the moment sessionBackupState is set and the
> callback is registered there is no CHECK_FOR_INTERRUPTS or even elog()
> calls so even if the process starting a non-exclusive backup is tried
> to be terminated between the moment the session lock is set and the
> callback is registered things are handled correctly.
>
> So I think that we are basically safe for backups running with the SQL
> interface.
Thank you for double checking!
> However, things are not completely clean for base backups taken using
> the replication protocol. While monitoring more the code, I have
> noticed that perform_base_backup() calls do_pg_stop_backup() *without*
> taking any cleanup action. So if a base backup is interrupted, say
> with SIGTERM, while do_pg_stop_backup() is called and before the
> session lock is updated then it is possible to finish with
> inconsistent in-memory counters. Oops.
Good catch! I'd missed this case.
> No need to play with breakpoints and signals in this case, using
> something like that is enough to create inconsistent counters.
> --- a/src/backend/replication/basebackup.c
> +++ b/src/backend/replication/basebackup.c
> @@ -327,6 +327,8 @@ perform_base_backup(basebackup_options *opt, DIR *tblspcdir)
>     }
>     PG_END_ENSURE_ERROR_CLEANUP(base_backup_cleanup, (Datum) 0);
>
> +   elog(ERROR, "base backups don't decrement counters here, stupid!");
> +
>     endptr = do_pg_stop_backup(labelfile->data, !opt->nowait, &endtli);
>
> A simple fix for this one is to call do_pg_stop_backup() before
> PG_END_ENSURE_ERROR_CLEANUP(). By doing so, we also make the callback
> cleanup logic consistent with what is done for the SQL equivalent
> where the callback is removed after finishing going through
> do_pg_stop_backup(). A comment would be adapted here, say something
> like "finish the backup while still holding the cleanup callback to
> avoid inconsistent in-memory data should the this call fail before
> sessionBackupState is updated."
>
> For the start phase, the current logic is fine, because in the case of
> the SQL interface the cleanup callback is registered after finishing
> do_pg_start_backup().
>
> What do you think?
I agree with your approach. It makes sense to me.
Attached updated patch. Please review it.
Regards,
--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
| Attachment | Content-Type | Size | 
|---|---|---|
| fix_do_pg_abort_backup_v7.patch | application/octet-stream | 3.2 KB | 
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Fabien COELHO | 2017-11-20 07:45:32 | Re: [HACKERS] [WIP] Zipfian distribution in pgbench | 
| Previous Message | Michael Paquier | 2017-11-20 05:57:15 | Re: Allowing SSL connection of v11 client to v10 server with SCRAM channel binding |