Re: BUG #16833: postgresql 13.1 process crash every hour

From: Alex F <phoedos16(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #16833: postgresql 13.1 process crash every hour
Date: 2021-05-12 10:04:13
Message-ID: CAGbr_zVNNrpDuDTi5eNYrUjg_8SQ_anggyfR6C0=Lx_3yv=jbg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Dear team, previously I wrote about postgresql crash (ref #16833)
Unfortunately the issue is still present. After databases reload from
backup I see that crash happens once 1-2 weeks.
It's almost impossible to activate debugging because daily logs size in
regular mode > 100gb/day with 10+million queries per day.

Together with colleagues I think we are able to identify a crash case. This
is a test environment running on pg13.2 with 10 databases inside (1 working
test server = 1 database)
Some tables inside the database are always updated with data product
crawler. Based on these tables in reporting purposes we build concurrent
materialized views which can build for 9-25min.

Some log example for such view build:
2021-05-03 06:14:55 UTC [6220]:
user=servlet,db=feature5042021,app=feature5-vertx,vertx-DataStore,client=vm_ip
LOG: duration: 0.031 ms parse <unnamed>: REFRESH MATERIALIZED VIEW
CONCURRENTLY project.product_original_mv
2021-05-03 06:14:55 UTC [6220]:
user=servlet,db=feature5042021,app=feature5-vertx,vertx-DataStore,client=vm_ip
LOG: duration: 0.038 ms bind <unnamed>: REFRESH MATERIALIZED VIEW
CONCURRENTLY project.product_original_mv
2021-05-03 06:16:21 UTC [6220]:
user=servlet,db=feature5042021,app=feature5-vertx,vertx-DataStore,client=vm_ip
STATEMENT: REFRESH MATERIALIZED VIEW CONCURRENTLY
project.product_original_mv
...
2021-05-03 06:21:22 UTC [6220]:
user=servlet,db=feature5042021,app=feature5-vertx,vertx-DataStore,client=vm_ip
STATEMENT: REFRESH MATERIALIZED VIEW CONCURRENTLY
project.product_original_mv
2021-05-03 06:37:42 UTC [6220]:
user=servlet,db=feature5042021,app=feature5-vertx,vertx-DataStore,client=vm_ip
LOG: duration: 1366910.618 ms execute <unnamed>: REFRESH MATERIALIZED
VIEW CONCURRENTLY project.product_original_mv

From example log cut you can see that STATEMENT [6220] process logs job
just every second until it's end.

Now suggest a review of the crash case. MATERIALIZED VIEW build started
work 18 min 12s before database crash
2021-05-10 06:45:58 UTC [28907]:
user=servlet,db=feature5042021,app=feature5-vertx,vertx-DataStore,client=vm_ip
LOG: duration: 0.039 ms parse <unnamed>: REFRESH MATERIALIZED VIEW
CONCURRENTLY project.product_original_mv
2021-05-10 06:45:58 UTC [28907]:
user=servlet,db=feature5042021,app=feature5-vertx,vertx-DataStore,client=vm_ip
LOG: duration: 0.039 ms bind <unnamed>: REFRESH MATERIALIZED VIEW
CONCURRENTLY project.product_original_mv
2021-05-10 06:47:06 UTC [28907]:
user=servlet,db=feature5042021,app=feature5-vertx,vertx-DataStore,client=vm_ip
STATEMENT: REFRESH MATERIALIZED VIEW CONCURRENTLY
project.product_original_mv
...with last logged entry for [28907] process
2021-05-10 06:52:35 UTC [28907]:
user=servlet,db=feature5042021,app=feature5-vertx,vertx-DataStore,client=vm_ip
STATEMENT: REFRESH MATERIALIZED VIEW CONCURRENTLY
project.product_original_mv
...then [28907] process do not log anymore
...and crash happened here after 11m 35s from last [28907] log entry
2021-05-10 07:04:10 UTC [1]: user=,db=,app=,client= LOG: server process
(PID 28907) was terminated by signal 11: Segmentation fault
2021-05-10 07:04:10 UTC [1]: user=,db=,app=,client= DETAIL: Failed process
was running: REFRESH MATERIALIZED VIEW CONCURRENTLY
project.product_original_mv
2021-05-10 07:04:10 UTC [1]: user=,db=,app=,client= LOG: terminating any
other active server processes
2021-05-10 07:04:10 UTC [32461]:
user=servlet,db=staging,app=staging-servlet,servlet-PostgresClient,client=vm_ip1
WARNING: terminating connection because of crash of another server process
2021-05-10 07:04:10 UTC [32461]:
user=servlet,db=staging,app=staging-servlet,servlet-PostgresClient,client=vm_ip1
DETAIL: The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
2021-05-10 07:04:10 UTC [32461]:
user=servlet,db=staging,app=staging-servlet,servlet-PostgresClient,client=vm_ip1
HINT: In a moment you should be able to reconnect to the database and
repeat your command.

One more important note. Production system run on pg12.2 and have
absolutely the same project logic with the same crawlers, table updates and
views but work stable for a long time without any crash
Test system with log examples shown run on pg13.2 and crash every week.

Thanks for your support!

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Valentin Nastasache 2021-05-12 11:45:55 Re: BUG #15293: Stored Procedure Triggered by Logical Replication is Unable to use Notification Events
Previous Message Tom Lane 2021-05-12 03:25:57 Re: BUG #15990: PROCEDURE throws "SQL Error [XX000]: ERROR: no known snapshots" with PostGIS geometries