standby replication server throws invalid memory alloc request size , does not start up

From: Vijaykumar Jain <vjain(at)opentable(dot)com>
To: "pgsql-admin(at)lists(dot)postgresql(dot)org" <pgsql-admin(at)lists(dot)postgresql(dot)org>
Subject: standby replication server throws invalid memory alloc request size , does not start up
Date: 2018-06-28 13:17:52
Message-ID: 1E4C5445-F307-4AE9-B1B9-C85A38B0D930@opentable.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

Hi All,

This is my first postgres query to admin list, so if I am not following the right standards for asking the question, pls let me know 😊

The problem:

I have a postgres cluster as

A (primary)-> streaming replication -> B(hot_standby=on)

We had a power outage in one of the data centers, and when we got back, one of the databases servers (B the standby node) seem to show weird errors and is not starting up. A recovered fine, and it running fine.

--------- logs

2018-06-25 10:57:04 UTC HINT: In a moment you should be able to reconnect to the database and repeat your command.
2018-06-25 10:57:04 UTC WARNING: terminating connection because of crash of another server process
2018-06-25 10:57:04 UTC DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2018-06-25 10:57:04 UTC HINT: In a moment you should be able to reconnect to the database and repeat your command.
2018-06-25 10:57:04 UTC LOG: database system is shut down
2018-06-27 16:59:28 UTC LOG: listening on IPv4 address "0.0.0.0", port 5432
2018-06-27 16:59:28 UTC LOG: listening on IPv6 address "::", port 5432
2018-06-27 16:59:28 UTC LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2018-06-27 16:59:28 UTC LOG: database system was interrupted while in recovery at log time 2018-06-25 10:52:21 UTC
2018-06-27 16:59:28 UTC HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
2018-06-27 16:59:28 UTC LOG: entering standby mode
2018-06-27 16:59:28 UTC LOG: recovered replication state of node 1 to 63/89B52A98
2018-06-27 16:59:28 UTC LOG: redo starts at 9A/6F1B3888
2018-06-27 16:59:28 UTC LOG: consistent recovery state reached at 9A/6F25BFF0
2018-06-27 16:59:28 UTC FATAL: invalid memory alloc request size 1768185856
2018-06-27 16:59:28 UTC LOG: database system is ready to accept read only connections
2018-06-27 16:59:28 UTC LOG: startup process (PID 11829) exited with exit code 1
2018-06-27 16:59:28 UTC LOG: terminating any other active server processes
2018-06-27 16:59:28 UTC LOG: database system is shut down
pg_ctl: could not start server
Examine the log output.

-------what I tried
I did a lot of searching around “invalid memory alloc request” it mostly points to some data error,
https://www.postgresql.org/message-id/AANLkTik88wGQ6h-xFHJBU_Fj_CnXK5LX7aBws_2oANKP%40mail.gmail.com

mem checks and disks are ok.

I took a filesystem snapshot, synced it onto another server and tried to reload it, it gave the same error.

Now to ensure there was not real data causing this issue, I took a full dump of the primary db (A) using pg_dumpall and restored in onto the test server. I was able to load the db and was able to query the tables without issues.
So I am confused why the standby is behaving weird after the power outage.

Just FYI, I spun up a new node ( C ) as a replacement for the problem standby(B) with pg_baseback and it is working fine. Can query all tables without problems. My only concern was do I need to worry about this error showing up again?

Most the googling points to some operation and finding out errors in some operation performed etc,
https://blog.dob.sk/2012/05/19/fixing-pg_dump-invalid-memory-alloc-request-size/ - 1
https://confluence.atlassian.com/jirakb/invalid-memory-alloc-request-size-440107132.html - 2

I ran that function (in 1), could not find bad data rows.

Also, I tried to bump up (memory to 32GB, cpu to 8 cpu and shmmax to 18446744073692774399 to figure out is there was any resource constraint that I could temp bump up and then debug what went wrong, but it did not start up.

$ /usr/lib/postgresql/10/bin/postgres -V
postgres (PostgreSQL) 10.4 (Ubuntu 10.4-2.pgdg16.04+1)

$ grep -v "#" /etc/postgresql/10/main/postgresql.conf | sed -e '/^$/d'
ssl = on
ssl_cert_file = '/etc/ssl/certs/ssl-cert-snakeoil.pem'
ssl_key_file = '/etc/ssl/private/ssl-cert-snakeoil.key'
log_timezone = UTC
stats_temp_directory = '/var/run/postgresql/10-main.pg_stat_tmp'
datestyle = 'iso, mdy'
timezone = UTC
default_text_search_config = 'pg_catalog.english'
log_directory = pg_log
archive_mode = on
maintenance_work_mem = 16MB
max_wal_size = 1GB
max_stack_depth = 2MB
autovacuum_analyze_scale_factor = 0.05
autovacuum_vacuum_scale_factor = 0.1
shared_preload_libraries = pg_stat_statements
work_mem = 4MB
max_wal_senders = 10
temp_buffers = 16MB
archive_command = '/bin/true'
effective_cache_size = 256MB
max_replication_slots = 10
hot_standby = on
wal_keep_segments = 64
wal_level = hot_standby
listen_addresses = '*'
log_min_duration_statement = 10000
max_worker_processes = 18

sysctl:
kernel.shmall = 18446744073692774399
kernel.shmmax = 18446744073692774399

$ free -m
total used free shared buff/cache available
Mem: 16047 134 14469 8 1442 15620
Swap: 975 0 975
$ nproc
4

Thanks,
Vijay

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Scott Ribe 2018-06-28 13:26:28 Re: standby replication server throws invalid memory alloc request size , does not start up
Previous Message Alexey Bashtanov 2018-06-28 13:10:51 Re: Server Crash