Re: BUG #18076: Consistently receiving Signal 7 and Signal 11 errors

From: Robert Sanford <wobbet(at)gmail(dot)com>
To: wobbet(at)wobbet(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #18076: Consistently receiving Signal 7 and Signal 11 errors
Date: 2023-08-29 21:37:13
Message-ID: CABa+nRtZoekMo5yFcF_CpcaMShtaBjexr24-xcnFOBS2b4bnAA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Note - one thing I read was that work_mem is a per-connection setting so
8GB * 32 active connections = way more RAM than the machine has.

I decreased that to 128MB and re-ran the tests. Failed in the same place.
It did get more done than it had previously, but failed in the same place.
Then I decreased it to 32MB and the same thing happened.

Using the Azure VM monitor I see that available memory has not ever gone
below 50% so work_mem may not be the issue.

rjsjr

On Tue, Aug 29, 2023 at 3:31 PM PG Bug reporting form <
noreply(at)postgresql(dot)org> wrote:

> The following bug has been logged on the website:
>
> Bug reference: 18076
> Logged by: Robert Sanford
> Email address: wobbet(at)wobbet(dot)com
> PostgreSQL version: 15.4
> Operating system: Linux 5.15.0-1042-azure #49~20.04.1-Ubuntu aarch64
> Description:
>
> Installed v15 via Docker on my local dev machine (Intel x64). Created the
> database schema, ran a bunch of tests. All very good.
>
> pg_dump the database and then imported into a new instance running on ARM64
> VM in Azure (8 cores, 64 GB RAM, Ubuntu 20.04.1). This is a full VM, not a
> docker instance. ARM64 CPUs are, according to Azure, physical CPUs and not
> virtual.
>
> Logs from the Azure machine show Signal 7 and Signal 11 errors consistently
> in the same place of the processing every time (log entries below). It's
> executing a specific function. Server restarts and has to catch up from the
> WAL. None of the other functions show this behavior. Running the stored
> procedure from DBeaver does not cause the same issues.
>
> My ETL app is a .NET Core 6 console application running on my local machine
> and connecting to either the DB on my local machine or the VM in Azure. It
> connects via NPGSQL. It doesn't seem to be horribly expensive in terms of
> CPU usage and generally finishes in < 5s. That being said, there can be as
> many as 32 simultaneous connections running the specific stored procedure.
>
>
> After reading other bug reports of Signal 7 I set huge_pages=off and
> disabled extensions such as autoanalyze. Other non-default config entries
> include:
> - max_connections = 100
> - shared_buffers = 8GB
> - work_mem = 8GB
> - maintenance_work_mem = 1GB
> - max_wal_size = 4GB
> - min_wal_size = 500MB
>
> I am hopeful that someone will be able to help me out with this. Let me
> know
> if there's any other details that I need to provide.
>
> Thanks!
>
> rjsjr
>
>
>
> Log entries...
> 2023-08-29 01:50:57.225 UTC [432307] LOG: server process (PID 433701) was
> terminated by signal 7: Bus error
> 2023-08-29 01:50:57.225 UTC [432307] DETAIL: Failed process was running:
> select * from process_route_compliance($1, $2, $3)
> 2023-08-29 01:50:57.225 UTC [432307] LOG: terminating any other active
> server processes
> 2023-08-29 01:50:57.231 UTC [432307] LOG: all server processes terminated;
> reinitializing
> 2023-08-29 01:50:57.441 UTC [438947] LOG: database system was interrupted;
> last known up at 2023-08-29 01:44:22 UTC
> 2023-08-29 01:50:57.547 UTC [438947] LOG: database system was not properly
> shut down; automatic recovery in progress
> 2023-08-29 01:50:57.553 UTC [438947] LOG: redo starts at F/3CD6C068
> 2023-08-29 01:50:57.896 UTC [438950] FATAL: the database system is not yet
> accepting connections
> 2023-08-29 01:50:57.896 UTC [438950] DETAIL: Consistent recovery state has
> not been yet reached.
> 2023-08-29 01:50:58.612 UTC [438947] LOG: unexpected pageaddr E/99864000
> in
> log segment 000000010000000F00000050, offset 8798208
> 2023-08-29 01:50:58.612 UTC [438947] LOG: redo done at F/50863FB0 system
> usage: CPU: user: 0.66 s, system: 0.32 s, elapsed: 1.05 s
> 2023-08-29 01:50:58.629 UTC [438948] LOG: checkpoint starting:
> end-of-recovery immediate wait
>
>
> 2023-08-29 01:51:07.996 UTC [438948] LOG: checkpoint complete: wrote 35042
> buffers (3.3%); 0 WAL file(s) added, 20 removed, 0 recycled; write=8.989 s,
> sync=0.197 s, total=9.372 s; sync files=254, longest=0.193 s, average=0.001
> s; distance=322528 kB, estimate=322528 kB
> 2023-08-29 01:51:08.027 UTC [432307] LOG: database system is ready to
> accept connections
> 2023-08-29 01:52:07.472 UTC [432307] LOG: server process (PID 440110) was
> terminated by signal 7: Bus error
> 2023-08-29 01:52:07.472 UTC [432307] DETAIL: Failed process was running:
> select * from process_route_compliance($1, $2, $3)
> 2023-08-29 01:52:07.472 UTC [432307] LOG: terminating any other active
> server processes
> 2023-08-29 01:52:07.500 UTC [432307] LOG: all server processes terminated;
> reinitializing
> 2023-08-29 01:52:07.689 UTC [440184] LOG: database system was interrupted;
> last known up at 2023-08-29 01:51:07 UTC
> 2023-08-29 01:52:07.797 UTC [440184] LOG: database system was not properly
> shut down; automatic recovery in progress
> 2023-08-29 01:52:07.803 UTC [440184] LOG: redo starts at F/508640C0
> 2023-08-29 01:52:07.892 UTC [440184] LOG: unexpected pageaddr E/C8C98000
> in
> log segment 000000010000000F00000051, offset 13205504
> 2023-08-29 01:52:07.892 UTC [440184] LOG: redo done at F/51C97F28 system
> usage: CPU: user: 0.06 s, system: 0.02 s, elapsed: 0.08 s
>
> 2023-08-29 01:52:07.908 UTC [440185] LOG: checkpoint starting:
> end-of-recovery immediate wait
> 2023-08-29 01:52:08.085 UTC [440185] LOG: checkpoint complete: wrote 1216
> buffers (0.1%); 0 WAL file(s) added, 1 removed, 0 recycled; write=0.140 s,
> sync=0.011 s, total=0.181 s; sync files=39, longest=0.005 s, average=0.001
> s; distance=20688 kB, estimate=20688 kB
> 2023-08-29 01:52:08.104 UTC [432307] LOG: database system is ready to
> accept connections
> 2023-08-29 01:52:08.153 UTC [440187] FATAL: the database system is not yet
> accepting connections
> 2023-08-29 01:52:08.153 UTC [440187] DETAIL: Consistent recovery state has
> not been yet reached.
> 2023-08-29 01:52:09.716 UTC [432307] LOG: server process (PID 440192) was
> terminated by signal 11: Segmentation fault
> 2023-08-29 01:52:09.716 UTC [432307] DETAIL: Failed process was running:
> select * from process_route_compliance($1, $2, $3)
> 2023-08-29 01:52:09.716 UTC [432307] LOG: terminating any other active
> server processes
> 2023-08-29 01:52:09.720 UTC [432307] LOG: all server processes terminated;
> reinitializing
> 2023-08-29 01:52:09.873 UTC [440196] LOG: database system was interrupted;
> last known up at 2023-08-29 01:52:08 UTC
> 2023-08-29 01:52:09.984 UTC [440196] LOG: database system was not properly
> shut down; automatic recovery in progress
> 2023-08-29 01:52:09.998 UTC [440196] LOG: redo starts at F/51C980C0
> 2023-08-29 01:52:10.001 UTC [440196] LOG: unexpected pageaddr E/C8CEA000
> in
> log segment 000000010000000F00000051, offset 13541376
> 2023-08-29 01:52:10.001 UTC [440196] LOG: redo done at F/51CE9AB0 system
> usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
> 2023-08-29 01:52:10.024 UTC [440197] LOG: checkpoint starting:
> end-of-recovery immediate wait
> 2023-08-29 01:52:10.132 UTC [440197] LOG: checkpoint complete: wrote 58
> buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.077 s,
> sync=0.014 s, total=0.113 s; sync files=26, longest=0.005 s, average=0.001
> s; distance=328 kB, estimate=328 kB
> 2023-08-29 01:52:10.141 UTC [432307] LOG: database system is ready to
> accept connections
> 2023-08-29 01:52:11.501 UTC [432307] LOG: server process (PID 440272) was
> terminated by signal 11: Segmentation fault
> 2023-08-29 01:52:11.501 UTC [432307] DETAIL: Failed process was running:
> select * from process_route_compliance($1, $2, $3)
> 2023-08-29 01:52:11.501 UTC [432307] LOG: terminating any other active
> server processes
> 2023-08-29 01:52:11.504 UTC [432307] LOG: all server processes terminated;
> reinitializing
> 2023-08-29 01:52:11.648 UTC [440279] LOG: database system was interrupted;
> last known up at 2023-08-29 01:52:10 UTC
> 2023-08-29 01:52:11.748 UTC [440279] LOG: database system was not properly
> shut down; automatic recovery in progress
> 2023-08-29 01:52:11.754 UTC [440279] LOG: redo starts at F/51CEA0C0
> 2023-08-29 01:52:11.759 UTC [440279] LOG: unexpected pageaddr E/C8D86000
> in
> log segment 000000010000000F00000051, offset 14180352
> 2023-08-29 01:52:11.759 UTC [440279] LOG: redo done at F/51D85F90 system
> usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
> 2023-08-29 01:52:11.778 UTC [440280] LOG: checkpoint starting:
> end-of-recovery immediate wait
> 2023-08-29 01:52:11.874 UTC [440280] LOG: checkpoint complete: wrote 89
> buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.062 s,
> sync=0.016 s, total=0.103 s; sync files=31, longest=0.008 s, average=0.001
> s; distance=624 kB, estimate=624 kB
> 2023-08-29 01:52:11.883 UTC [432307] LOG: database system is ready to
> accept connections
> 2023-08-29 01:53:16.516 UTC [432307] LOG: server process (PID 440285) was
> terminated by signal 11: Segmentation fault
> 2023-08-29 01:53:16.516 UTC [432307] DETAIL: Failed process was running:
> select * from process_route_compliance($1, $2, $3)
>
>

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2023-08-29 21:46:49 Re: BUG #18075: configuration variable idle_session_timeout not working as expected
Previous Message David G. Johnston 2023-08-29 21:12:00 Re: BUG #18075: configuration variable idle_session_timeout not working as expected