Re: How abnormal server shutdown could be detected by tests?

From: shveta malik <shveta(dot)malik(at)gmail(dot)com>
To: Alexander Lakhin <exclusion(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, shveta malik <shveta(dot)malik(at)gmail(dot)com>
Subject: Re: How abnormal server shutdown could be detected by tests?
Date: 2023-12-12 08:44:09
Message-ID: CAJpy0uAqpyWKSvYiCRUyVMJBT+4-MKX6CtA6D=-_jjaxTXwvhA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Dec 9, 2023 at 9:30 AM Alexander Lakhin <exclusion(at)gmail(dot)com> wrote:
>
> Hello hackers,
>
> While studying bug #18158, I've come to the conclusion that the existing
> testing infrastructure is unable to detect abnormal situations. of some
> kind.
>
> Just a simple example:
> With Assert(0) injected in walsender (say:
> sed "s/WalSndDone(send_data)/Assert(0)/" -i src/backend/replication/walsender.c
> ), I observe the following:
> $ make -s check -C src/test/recovery PROVE_TESTS="t/012*"
>
> t/012_subtransactions.pl .. ok
> All tests successful.
>
> (The same with meson:
> 1/1 postgresql:recovery / recovery/012_subtransactions OK 3.24s 12 subtests passed)
>
> But:
> $ grep TRAP src/test/recovery/tmp_check/log/*
> src/test/recovery/tmp_check/log/012_subtransactions_primary.log:TRAP: failed Assert("0"), File: "walsender.c", Line:
> 2528, PID: 373729
> src/test/recovery/tmp_check/log/012_subtransactions_primary.log:TRAP: failed Assert("0"), File: "walsender.c", Line:
> 2528, PID: 373750
> src/test/recovery/tmp_check/log/012_subtransactions_primary.log:TRAP: failed Assert("0"), File: "walsender.c", Line:
> 2528, PID: 373821
> src/test/recovery/tmp_check/log/012_subtransactions_standby.log:TRAP: failed Assert("0"), File: "walsender.c", Line:
> 2528, PID: 373786
>
> src/test/recovery/tmp_check/log/012_subtransactions_primary.log contains:
> ...
> 2023-12-09 03:23:20.210 UTC [375933] LOG: server process (PID 375975) was terminated by signal 6: Aborted
> 2023-12-09 03:23:20.210 UTC [375933] DETAIL: Failed process was running: START_REPLICATION 0/3000000 TIMELINE 3
> 2023-12-09 03:23:20.210 UTC [375933] LOG: terminating any other active server processes
> 2023-12-09 03:23:20.210 UTC [375933] LOG: abnormal database system shutdown
> 2023-12-09 03:23:20.211 UTC [375933] LOG: database system is shut down
> ...
>
> So the shutdown definitely was considered abnormal, but we missed that.
>
> With log_min_messages = DEBUG3, I can see in the log:
> 2023-12-09 03:26:50.145 UTC [377844] LOG: abnormal database system shutdown
> 2023-12-09 03:26:50.145 UTC [377844] DEBUG: shmem_exit(1): 0 before_shmem_exit callbacks to make
> 2023-12-09 03:26:50.145 UTC [377844] DEBUG: shmem_exit(1): 5 on_shmem_exit callbacks to make
> 2023-12-09 03:26:50.145 UTC [377844] DEBUG: cleaning up orphaned dynamic shared memory with ID 2898643884
> 2023-12-09 03:26:50.145 UTC [377844] DEBUG: cleaning up dynamic shared memory control segment with ID 3446966170
> 2023-12-09 03:26:50.146 UTC [377844] DEBUG: proc_exit(1): 2 callbacks to make
> 2023-12-09 03:26:50.146 UTC [377844] LOG: database system is shut down
> 2023-12-09 03:26:50.146 UTC [377844] DEBUG: exit(1)
> 2023-12-09 03:26:50.146 UTC [377844] DEBUG: shmem_exit(-1): 0 before_shmem_exit callbacks to make
> 2023-12-09 03:26:50.146 UTC [377844] DEBUG: shmem_exit(-1): 0 on_shmem_exit callbacks to make
> 2023-12-09 03:26:50.146 UTC [377844] DEBUG: proc_exit(-1): 0 callbacks to make
>
> The postmaster process exits with exit code 1, but pg_ctl can't get the
> code and just reports that stop was completed successfully.
>

For what it's worth, there is another thread which stated the similar problem:
https://www.postgresql.org/message-id/flat/2366244.1651681550%40sss.pgh.pa.us

> Should this be improved? And if yes, how it can be done?
> Maybe postmaster shouldn't remove it's postmaster.pid when it exits
> abnormally to let pg_ctl know of it?
>

thanks
Shveta

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message shveta malik 2023-12-12 09:14:28 Re: Synchronizing slots from primary to standby
Previous Message Nicolas Lutic 2023-12-12 08:40:14 planner chooses incremental but not the best one