Re: Backup command and functions can cause assertion failure and segmentation fault

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Backup command and functions can cause assertion failure and segmentation fault
Date: 2022-07-01 06:09:48
Message-ID: CAD21AoBkR611M6K+6i-xNDjmCd-=hFN_svNgRtvuezyjviw1sA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Thu, Jun 30, 2022 at 12:29 PM Fujii Masao
<masao(dot)fujii(at)oss(dot)nttdata(dot)com> wrote:
>
> Hi,
>
> I found that the assertion failure and the segmentation fault could
> happen by running pg_backup_start(), pg_backup_stop() and BASE_BACKUP
> replication command, in v15 or before.
>
> Here is the procedure to reproduce the assertion failure.
>
> 1. Connect to the server as the REPLICATION user who is granted
> EXECUTE to run pg_backup_start() and pg_backup_stop().
>
> $ psql
> =# CREATE ROLE foo REPLICATION LOGIN;
> =# GRANT EXECUTE ON FUNCTION pg_backup_start TO foo;
> =# GRANT EXECUTE ON FUNCTION pg_backup_stop TO foo;
> =# \q
>
> $ psql "replication=database user=foo dbname=postgres"
>
> 2. Run pg_backup_start() and pg_backup_stop().
>
> => SELECT pg_backup_start('test', true);
> => SELECT pg_backup_stop();
>
> 3. Run BASE_BACKUP replication command with smaller MAX_RATE so that
> it can take a long time to finish.
>
> => BASE_BACKUP (CHECKPOINT 'fast', MAX_RATE 32);
>
> 4. Terminate the replication connection while it's running BASE_BACKUP.
>
> $ psql
> =# SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE backend_type = 'walsender';
>
> This procedure can cause the following assertion failure.
>
> TRAP: FailedAssertion("XLogCtl->Insert.runningBackups > 0", File: "xlog.c", Line: 8779, PID: 69434)
> 0 postgres 0x000000010ab2ff7f ExceptionalCondition + 223
> 1 postgres 0x000000010a455126 do_pg_abort_backup + 102
> 2 postgres 0x000000010a8e13aa shmem_exit + 218
> 3 postgres 0x000000010a8e11ed proc_exit_prepare + 125
> 4 postgres 0x000000010a8e10f3 proc_exit + 19
> 5 postgres 0x000000010ab3171c errfinish + 1100
> 6 postgres 0x000000010a91fa80 ProcessInterrupts + 1376
> 7 postgres 0x000000010a886907 throttle + 359
> 8 postgres 0x000000010a88675d bbsink_throttle_archive_contents + 29
> 9 postgres 0x000000010a885aca bbsink_archive_contents + 154
> 10 postgres 0x000000010a885a2a bbsink_forward_archive_contents + 218
> 11 postgres 0x000000010a884a99 bbsink_progress_archive_contents + 89
> 12 postgres 0x000000010a881aba bbsink_archive_contents + 154
> 13 postgres 0x000000010a881598 sendFile + 1816
> 14 postgres 0x000000010a8806c5 sendDir + 3573
> 15 postgres 0x000000010a8805d9 sendDir + 3337
> 16 postgres 0x000000010a87e262 perform_base_backup + 1250
> 17 postgres 0x000000010a87c734 SendBaseBackup + 500
> 18 postgres 0x000000010a89a7f8 exec_replication_command + 1144
> 19 postgres 0x000000010a92319a PostgresMain + 2154
> 20 postgres 0x000000010a82b702 BackendRun + 50
> 21 postgres 0x000000010a82acfc BackendStartup + 524
> 22 postgres 0x000000010a829b2c ServerLoop + 716
> 23 postgres 0x000000010a827416 PostmasterMain + 6470
> 24 postgres 0x000000010a703e19 main + 809
> 25 libdyld.dylib 0x00007fff2072ff3d start + 1
>
>
> Here is the procedure to reproduce the segmentation fault.
>
> 1. Connect to the server as the REPLICATION user who is granted
> EXECUTE to run pg_backup_stop().
>
> $ psql
> =# CREATE ROLE foo REPLICATION LOGIN;
> =# GRANT EXECUTE ON FUNCTION pg_backup_stop TO foo;
> =# \q
>
> $ psql "replication=database user=foo dbname=postgres"
>
> 2. Run BASE_BACKUP replication command with smaller MAX_RATE so that
> it can take a long time to finish.
>
> => BASE_BACKUP (CHECKPOINT 'fast', MAX_RATE 32);
>
> 3. Press Ctrl-C to cancel BASE_BACKUP while it's running.
>
> 4. Run pg_backup_stop().
>
> => SELECT pg_backup_stop();
>
> This procedure can cause the following segmentation fault.
>
> LOG: server process (PID 69449) was terminated by signal 11: Segmentation fault: 11
> DETAIL: Failed process was running: SELECT pg_backup_stop();
>
>
> The root cause of these failures seems that sessionBackupState flag
> is not reset to SESSION_BACKUP_NONE even when BASE_BACKUP is aborted.
> So attached patch changes do_pg_abort_backup callback so that
> it resets sessionBackupState. I confirmed that, with the patch,
> those assertion failure and segmentation fault didn't happen.

The change looks good to me. I've also confirmed the change fixed the issues.

> But this change has one issue that; if BASE_BACKUP is run while
> a backup is already in progress in the session by pg_backup_start()
> and that session is terminated, the change causes XLogCtl->Insert.runningBackups
> to be decremented incorrectly. That is, XLogCtl->Insert.runningBackups
> is incremented by two by pg_backup_start() and BASE_BACKUP,
> but it's decremented only by one by the termination of the session.
>
> To address this issue, I think that we should disallow BASE_BACKUP
> to run while a backup is already in progress in the *same* session
> as we already do this for pg_backup_start(). Thought? I included
> the code to disallow that in the attached patch.

+1

@@ -233,6 +233,12 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
StringInfo labelfile;
StringInfo tblspc_map_file;
backup_manifest_info manifest;
+ SessionBackupState status = get_backup_status();
+
+ if (status == SESSION_BACKUP_RUNNING)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("a backup is already in progress in this session")));

I think we can move it to the beginning of SendBaseBackup() so we can
avoid bbsink initialization and cleanup in the error case.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dilip Kumar 2022-07-01 06:11:37 Re: making relfilenodes 56 bits
Previous Message Amit Kapila 2022-07-01 06:01:09 Re: Issue with pg_stat_subscription_stats