Re: psql: FATAL: the database system is starting up

From: Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com>
To: Tom K <tomkcpr(at)gmail(dot)com>, pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: psql: FATAL: the database system is starting up
Date: 2019-05-28 13:53:32
Message-ID: 5b7e0d74-8d6e-c3b8-fb62-3aab1fe0145d@aklaver.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 5/27/19 9:59 PM, Tom K wrote:
> Hey Guy's,
>
>
> I'm running Patroni w/ PostgreSQL 10, ETCD, Haproxy and Keepalived on 3
> RHEL 7.6 VM's.  Every now and then the underlying storage crashes taking
> out the cluster.  On recovery, PostgreSQL tends to come up while other
> databases just blow up.  That is until now.

The above is one master and two standby servers connected via streaming
replication, correct?

Is there WAL archiving set up?

>
> After the most recent crash 2-3 weeks ago, the cluster is now running
> into this message but I'm not able to make heads or tails out of why
> it's throwing this:

So you have not been able to run the cluster the past 2-3 weeks or is
that more recent?

When you refer to history files below are you talking about WAL files or
something else?

Is this:

"recovery command file "recovery.conf" specified neither
primary_conninfo nor restore_command"

true?

>
> |-bash-4.2$/usr/pgsql-10/bin/postgres -D /data/patroni
> --config-file=/data/patroni/postgresql.conf2019-05-2308:26:34.561EDT
> [10101]LOG:listening onIPv4 address "192.168.0.108",port
> 54322019-05-2308:26:34.565EDT [10101]LOG:listening onUnix socket
> "./.s.PGSQL.5432"2019-05-2308:26:34.591EDT [10101]LOG:redirecting log
> output tologging collector process 2019-05-2308:26:34.591EDT
> [10101]HINT:Future log output will appear indirectory
> "log".^C-bash-4.2/usr/pgsql-10/bin/postgres -D /data/patroni
> --config-file=/data/patroni/postgresql.conf --hot_standby=on
> --listen_addresses=192.168.0.108 --max_worker_processes=8
> --max_locks_per_transaction=64 --wal_level=replica
> --cluster_name=postgres --wal_log_hints=on --max_wal_senders=10
> --track_commit_timestamp=off --max_prepared_transactions=0 --port=5432
> --max_replication_slots=10
> --max_connections=1002019-05-2308:26:45.707EDT [10105]LOG:listening
> onIPv4 address "192.168.0.108",port 54322019-05-2308:26:45.715EDT
> [10105]LOG:listening onUnix socket
> "./.s.PGSQL.5432"2019-05-2308:26:45.740EDT [10105]LOG:redirecting log
> output tologging collector process 2019-05-2308:26:45.740EDT
> [10105]HINT:Future log output will appear indirectory
> "log".^C-bash-4.2$/usr/pgsql-10/bin/postgres -D /data/patroni
> --config-file=/data/patroni/postgresql.conf --hot_standby=on
> --listen_addresses=192.168.0.1 --max_worker_processes=8
> --max_locks_per_transaction=64 --wal_level=replica
> --cluster_name=postgres --wal_log_hints=on --max_wal_senders=10
> --track_commit_timestamp=off --max_prepared_transactions=0 --port=5432
> --max_replication_slots=10
> --max_connections=1002019-05-2308:29:36.014EDT [10188]LOG:listening
> onIPv4 address "192.168.0.108",port 54322019-05-2308:29:36.017EDT
> [10188]LOG:listening onUnix socket
> "./.s.PGSQL.5432"2019-05-2308:29:36.040EDT [10188]LOG:redirecting log
> output tologging collector process 2019-05-2308:29:36.040EDT
> [10188]HINT:Future log output will appear indirectory "log".|
>
> The database logs only print out this:
>
> |2019-05-2308:25:44.476EDT [10088]HINT:The databaseserver will regularly
> poll the pg_wal subdirectory tocheckforfiles placed
> there.2019-05-2308:25:44.477EDT [10088]LOG:entering standby mode
> 2019-05-2308:25:50.209EDT [10086]LOG:received fast shutdownrequest
> 2019-05-2308:25:50.215EDT [10089]LOG:shutting down
> 2019-05-2308:25:50.221EDT [10086]LOG:databasesystem isshut down
> 2019-05-2308:26:11.560EDT [10094]LOG:databasesystem was shut down
> inrecovery at 2019-05-2308:25:50EDT 2019-05-2308:26:11.560EDT
> [10094]WARNING:recovery command file"recovery.conf"specified neither
> primary_conninfo nor restore_command 2019-05-2308:26:11.560EDT
> [10094]HINT:The databaseserver will regularly poll the pg_wal
> subdirectory tocheckforfiles placed there.2019-05-2308:26:11.562EDT
> [10094]LOG:entering standby mode 2019-05-2308:26:17.817EDT
> [10092]LOG:received fast shutdownrequest 2019-05-2308:26:17.824EDT
> [10095]LOG:shutting down 2019-05-2308:26:17.831EDT
> [10092]LOG:databasesystem isshut down 2019-05-2308:26:34.596EDT
> [10103]LOG:databasesystem was shut down inrecovery at
> 2019-05-2308:26:17EDT 2019-05-2308:26:34.596EDT [10103]WARNING:recovery
> command file"recovery.conf"specified neither primary_conninfo nor
> restore_command 2019-05-2308:26:34.596EDT [10103]HINT:The databaseserver
> will regularly poll the pg_wal subdirectory tocheckforfiles placed
> there.2019-05-2308:26:34.598EDT [10103]LOG:entering standby mode
> 2019-05-2308:26:43.025EDT [10101]LOG:received fast shutdownrequest
> 2019-05-2308:26:43.032EDT [10104]LOG:shutting down
> 2019-05-2308:26:43.038EDT [10101]LOG:databasesystem isshut down
> 2019-05-2308:26:45.748EDT [10107]LOG:databasesystem was shut down
> inrecovery at 2019-05-2308:26:43EDT 2019-05-2308:26:45.748EDT
> [10107]WARNING:recovery command file"recovery.conf"specified neither
> primary_conninfo nor restore_command 2019-05-2308:26:45.748EDT
> [10107]HINT:The databaseserver will regularly poll the pg_wal
> subdirectory tocheckforfiles placed there.|
>
> and when connecting, I get this cryptic message:
>
> |[root(at)psql01 log]#psql -U postgres -h 192.168.0.108psql:FATAL:the
> databasesystem isstarting up [root(at)psql01 log]#|
>
> There's no space issue here.  The cluster stays in recovery. This state
> doesn't change and the DB is inaccessible.
>
> So I ran the above in debug mode ( -d 5 ) and got this:
>
>
> |DEBUG:mmap(148897792)withMAP_HUGETLB failed,huge pages disabled:Cannot
> allocate memory|
>
> so I freed memory on the underlying physical this VM sits on but no
> luck. Same memory error and same result as above.
>
> When patroni is started on just this one node (troubleshooting purposes)
> I get the message:
>
> |[root(at)psql01 ~]#ps -ef|grep -Ei "patroni|postgres"root
> 22172188000:38pts/100:00:00tail -f postgresql-Thu.log postgres
> 25121400:42?00:00:01/usr/bin/python2 /bin/patroni /etc/patroni.yml
> postgres 25331000:42?00:00:00/usr/pgsql-10/bin/postgres -D /data/patroni
> --config-file=/data/patroni/postgresql.conf --hot_standby=on
> --listen_addresses=192.168.0.108 --max_worker_processes=8
> --max_locks_per_transaction=64 --wal_level=replica
> --cluster_name=postgres --wal_log_hints=on --max_wal_senders=10
> --track_commit_timestamp=off --max_prepared_transactions=0 --port=5432
> --max_replication_slots=10 --max_connections=100postgres
> 25352533000:42?00:00:00postgres:postgres:logger process postgres
> 25362533000:42?00:00:00postgres:postgres:startup process waiting
> for000000010000000000000008root 26642039000:42pts/000:00:00grep
> --color=auto -Ei patroni|postgres[root(at)psql01 ~]#|
>
> The system prints "postgres: postgres: startup process waiting for
> 000000010000000000000008"
>
> Seems it's trying to replay the history log but can't? Running an strace
> on the file reveals that it's looking for the following:
>
> |read(5,0x7fff9cb4eb87,1)=-1EAGAIN (Resource temporarily
> unavailable)read(5,0x7fff9cb4eb87,1)=-1EAGAIN (Resource temporarily
> unavailable)open("pg_wal/00000098.history",O_RDONLY)=-1ENOENT (No such
> fileordirectory)epoll_create1(EPOLL_CLOEXEC)=3|
>
> But this file doesn't exist in the directory on this particular cluster
> node (psql01).  I copy the missing history files off another cluster
> node to this one and it does move forward but just get's stuck on
> another history file.  But same error message as above on psql01.
>
> So I give up and try to recover the secondary node (psql02).  But this
> yields this error:
>
> |PANIC:replicationcheckpointhas wrong magic 0instead of307747550|
>
> and I'm not able to find very much to move me forward on this one either.
>
> So again I give up after a few hours of trying to recover psql02 and
> move on to the third node.  On the third cluster node (pgsql03) I get
> this error which I'm also not able to find much info on:
>
> |FATAL:syntax error inhistory file:f2W|
>
> Hoping I'm able to recover the data from at least one of the nodes, make
> it primary, and get it to replicate to the others.
>
> Thx, TK
>

--
Adrian Klaver
adrian(dot)klaver(at)aklaver(dot)com

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message RAJIN RAJ K 2019-05-28 14:36:16 Alternate methods for multiple rows input/output to a function.
Previous Message Tom K 2019-05-28 04:59:23 psql: FATAL: the database system is starting up