Re: Pgpool can't detect database status properly

From: Adam Blomeke <adam(dot)blomeke(at)gmail(dot)com>
To: Tatsuo Ishii <ishii(at)postgresql(dot)org>
Cc: pgpool-general(at)lists(dot)postgresql(dot)org
Subject: Re: Pgpool can't detect database status properly
Date: 2025-12-30 21:59:57
Message-ID: CAG9AmshGJjxEdjRuYJKYt4nJMOuvs2M_mZ1T7cYLwbbrmG7yzw@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgpool-general

Follow up on this. I've got autofailback set to off, but it's executing the
follow primary script on the node as soon as I attempt to detach it. Is
this expected behavior?

[postgres(at)awaprodxtrldbpgpool1 ~]$ grep auto_failback
/etc/pgpool-II/pgpool.conf
auto_failback = off
auto_failback_interval = 1
# Min interval of executing
auto_failback in
[postgres(at)awaprodxtrldbpgpool1 ~]$ # Checkpoint on primary
psql -h 10.6.1.199 -d postgres -c "CHECKPOINT;"
# Detach primary to trigger failover
pcp_detach_node -h 10.6.1.54 -U pgpool -n 0
# Verify node 1 is now primary
pcp_node_info -a -U pgpool -h 10.6.1.54
CHECKPOINT
pcp_detach_node -- Command Successful
10.6.1.199 5432 3 0.500000 down up primary primary 0 none none 2025-12-30
14:22:10
10.6.1.200 5432 2 0.500000 up up standby standby 0 none none 2025-12-30
12:04:09
[postgres(at)awaprodxtrldbpgpool1 ~]$ pcp_node_info -a -U pgpool -h 10.6.1.54
10.6.1.199 5432 3 0.500000 down down standby unknown 0 none none 2025-12-30
14:22:11
10.6.1.200 5432 2 0.500000 up up primary primary 0 none none 2025-12-30
14:22:11
[postgres(at)awaprodxtrldbpgpool1 ~]$ ssh postgres(at)10(dot)6(dot)1(dot)199
"/usr/pgsql-18/bin/pg_ctl -D /opt/data/data18 stop"
Authorized uses only. All activity may be monitored and reported.
pg_ctl: PID file "/opt/data/data18/postmaster.pid" does not exist
Is server running?
[postgres(at)awaprodxtrldbpgpool1 ~]$ ssh postgres(at)10(dot)6(dot)1(dot)199
Authorized uses only. All activity may be monitored and reported.
Last login: Tue Dec 30 11:45:09 2025 from 10.6.1.196
-bash: typeset: TMOUT: readonly variable
[postgres(at)awaproddbvmnasdwusers1 ~]$ cd /opt/data
[postgres(at)awaproddbvmnasdwusers1 data]$ ll
total 1236
drwxr-x---. 2 postgres postgres 6 Dec 30 14:22 archive18
drwx------. 23 postgres postgres 4096 Dec 29 12:46 data15
drwx------. 13 postgres postgres 4096 Dec 30 14:22 data18
drwx------. 2 postgres postgres 74 Dec 1 15:50 dbservercert
-rw-r-----. 1 postgres postgres 1255731 Oct 24 18:04 pg_basebackup.log
[postgres(at)awaproddbvmnasdwusers1 data]$ cd data18/
[postgres(at)awaproddbvmnasdwusers1 data18]$ ll
total 8
-rw-------. 1 postgres postgres 233 Dec 30 14:22 backup_label
drwx------. 6 postgres postgres 66 Dec 30 14:22 base
drwx------. 2 postgres postgres 4096 Dec 30 14:22 global
drwx------. 2 postgres postgres 26 Dec 30 14:22 pg_commit_ts
drwx------. 2 postgres postgres 10 Dec 30 14:22 pg_dynshmem
drwx------. 4 postgres postgres 48 Dec 30 14:22 pg_multixact
drwx------. 2 postgres postgres 10 Dec 30 14:22 pg_notify
drwx------. 2 postgres postgres 10 Dec 30 14:22 pg_serial
drwx------. 2 postgres postgres 10 Dec 30 14:22 pg_snapshots
drwx------. 2 postgres postgres 10 Dec 30 14:22 pg_subtrans
drwx------. 2 postgres postgres 10 Dec 30 14:22 pg_twophase
drwx------. 4 postgres postgres 121 Dec 30 14:22 pg_wal
[postgres(at)awaproddbvmnasdwusers1 data18]$

Cheers,
Adam

On Tue, Dec 16, 2025 at 5:29 PM Tatsuo Ishii <ishii(at)postgresql(dot)org> wrote:

> > Thanks for the reply.
> >
> > Your response makes sense as I'm still setting this cluster up, so it's
> > just me trying to connect to it.
> >
> > I'm curious then what the right process is for when I need to pull a node
> > out of the cluster for maintenance (e.g. patching). I was under the
> > impression that I should drop the node, do a pg_rewind, manually set it
> as
> > a standby if it was the primary, and then add the node back in pgpool. I
> > guess I can't do that with auto failback turned on?
>
> Yes, while the maintenance, you should turn off auto failback.
> In fact, it's written in the document:
>
> https://www.pgpool.net/docs/47/en/html/runtime-config-failover.html#RUNTIME-CONFIG-FAILOVER-SETTINGS
>
> If you plan to detach standby node for maintenance, set this
> parameter to off beforehand. Otherwise it's possible that standby
> node is reattached against your intention.
>
> Best regards,
> --
> Tatsuo Ishii
> SRA OSS K.K.
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp
>
> > Cheers,
> > Adam
> >
> >
> > On Mon, Dec 15, 2025 at 6:43 PM Tatsuo Ishii <ishii(at)postgresql(dot)org>
> wrote:
> >
> >> > I'm resending this as it's been sitting in the moderation queue for a
> >> > while. Possibly because I didn't have a subject line? Anyways, any
> help
> >> > would be great. Thanks!
> >>
> >> I received your email this time.
> >>
> >> > I’m setting up a pgpool cluster to replace a single node database in
> my
> >> > environment. The single node is separate from the cluster at the
> moment.
> >> > When it’s time to implement the DB I’m going to redo the
> backup/restore,
> >> > throw an upgrade from pg15->18, and then bring the cluster and take
> over
> >> > the old IP.
> >> >
> >> >
> >> >
> >> > *Environment:*
> >> >
> >> > - pgpool-II version: 4.6.3 (chirikoboshi)
> >> > - PostgreSQL version: 18
> >> > - OS: RHEL9
> >> > - Cluster topology: 3 pgpool nodes (10.6.1.196, 10.6.1.197,
> >> 10.6.1.198)
> >> > + 2 PostgreSQL nodes (10.6.1.199 primary, 10.6.1.200 standby)
> >> >
> >> >
> >> >
> >> > *Issue:*
> >> >
> >> > I have pgpool configured and I’ve set it up using the scripts and
> config
> >> > files from a different instance, one which has been running just fine
> >> for a
> >> > year and a half or so. The issue I’m experiencing is that when I
> >> > detach/reattach a node, it sits in waiting constantly. It never
> >> transitions
> >> > to up.
> >>
> >> If you connect to 10.6.1.196 (or 10.6.1.197, 10.6.1.198) using psql
> >> and issue an SQL command, for example "SELECT 1", does it work? If it
> >> works, it means pgpool works fine.
> >>
> >> > I have to manually change the status file to up for it to get to
> >> > agree that it is,
> >>
> >> The pgpool status "waiting" means that the backend node has never
> >> revceived any query from pgpool clients yet. You can safely assume
> >> that that pgpool is up and running. Once pgpool receives queries, the
> >> status should be changed from "waiting" to "up".
> >>
> >> > and when I try to drop the node it doesn't actually drop
> >> > it. It just goes into waiting again.
> >>
> >> Sounds like an effect of auto fail back. because you set:
> >>
> >> auto_failback_interval = 1
> >>
> >> pgpool almost immediately brings the pgpool to online.
> >>
> >> > I also don’t see any connection
> >> > attempts from the pgpool server to the postgres nodes if I look at
> >> postgres
> >> > logs. I've confirmed that it can run the postgres commands from the
> >> command
> >> > line. I've tried this both running pgpool as a service and running it
> >> > directly from the command line. No difference in behavior.
> >>
> >> Probably there's something wrong in the configuration or trying to
> >> connect to wrong IP and/or port. Please turn on log_client_messages
> >> and log_per_node_statement, then send an SQL command to pgpool, and
> >> examin the pgpool log.
> >>
> >> > Here’s the log output:
> >> >
> >> > 2025-12-03 14:20:49.037: main pid 1085028: LOG: === Starting fail
> back.
> >> > reconnect host 10.6.1.200(5432) ===
> >> >
> >> > 2025-12-03 14:20:49.037: main pid 1085028: LOCATION:
> pgpool_main.c:4169
> >> >
> >> > 2025-12-03 14:20:49.037: main pid 1085028: LOG: Node 0 is not down
> >> > (status: 2)
> >> >
> >> > 2025-12-03 14:20:49.037: main pid 1085028: LOCATION:
> pgpool_main.c:1524
> >> >
> >> > 2025-12-03 14:20:49.038: main pid 1085028: LOG: Do not restart
> children
> >> > because we are failing back node id 1 host: 10.6.1.200 port: 5432 and
> we
> >> > are in streaming replication mode and not all backends were down
> >> >
> >> > 2025-12-03 14:20:49.038: main pid 1085028: LOCATION:
> pgpool_main.c:4370
> >> >
> >> > 2025-12-03 14:20:49.038: main pid 1085028: LOG:
> >> > find_primary_node_repeatedly: waiting for finding a primary node
> >> >
> >> > 2025-12-03 14:20:49.038: main pid 1085028: LOCATION:
> pgpool_main.c:2896
> >> >
> >> > 2025-12-03 14:20:49.189: main pid 1085028: LOG: find_primary_node:
> >> primary
> >> > node is 0
> >> >
> >> > 2025-12-03 14:20:49.189: main pid 1085028: LOCATION:
> pgpool_main.c:2815
> >> >
> >> > 2025-12-03 14:20:49.189: main pid 1085028: LOG: find_primary_node:
> >> standby
> >> > node is 1
> >> >
> >> > 2025-12-03 14:20:49.189: main pid 1085028: LOCATION:
> pgpool_main.c:2821
> >> >
> >> > 2025-12-03 14:20:49.189: main pid 1085028: LOG: failover: set new
> >> primary
> >> > node: 0
> >> >
> >> > 2025-12-03 14:20:49.189: main pid 1085028: LOCATION:
> pgpool_main.c:4660
> >> >
> >> > 2025-12-03 14:20:49.189: main pid 1085028: LOG: failover: set new
> main
> >> > node: 0
> >> >
> >> > 2025-12-03 14:20:49.189: main pid 1085028: LOCATION:
> pgpool_main.c:4667
> >> >
> >> > 2025-12-03 14:20:49.189: main pid 1085028: LOG: === Failback done.
> >> > reconnect host 10.6.1.200(5432) ===
> >> >
> >> > 2025-12-03 14:20:49.189: main pid 1085028: LOCATION:
> pgpool_main.c:4763
> >> >
> >> > 2025-12-03 14:20:49.189: sr_check_worker pid 1085088: LOG: worker
> >> process
> >> > received restart request
> >> >
> >> > 2025-12-03 14:20:49.189: sr_check_worker pid 1085088: LOCATION:
> >> > pool_worker_child.c:182
> >> >
> >> > 2025-12-03 14:20:50.189: pcp_main pid 1085087: LOG: restart request
> >> > received in pcp child process
> >> >
> >> > 2025-12-03 14:20:50.189: pcp_main pid 1085087: LOCATION:
> pcp_child.c:173
> >> >
> >> > 2025-12-03 14:20:50.193: main pid 1085028: LOG: PCP child 1085087
> exits
> >> > with status 0 in failover()
> >> >
> >> > 2025-12-03 14:20:50.193: main pid 1085028: LOCATION:
> pgpool_main.c:4850
> >> >
> >> > 2025-12-03 14:20:50.193: main pid 1085028: LOG: fork a new PCP child
> pid
> >> > 1085089 in failover()
> >> >
> >> > 2025-12-03 14:20:50.193: main pid 1085028: LOCATION:
> pgpool_main.c:4854
> >> >
> >> > 2025-12-03 14:20:50.193: pcp_main pid 1085089: LOG: PCP process:
> 1085089
> >> > started
> >> >
> >> > 2025-12-03 14:20:50.193: pcp_main pid 1085089: LOCATION:
> pcp_child.c:165
> >> >
> >> > 2025-12-03 14:20:50.194: sr_check_worker pid 1085090: LOG: process
> >> started
> >> >
> >> > 2025-12-03 14:20:50.194: sr_check_worker pid 1085090: LOCATION:
> >> > pgpool_main.c:905
> >> >
> >> > 2025-12-03 14:22:31.460: pcp_main pid 1085089: LOG: forked new pcp
> >> worker,
> >> > pid=1085093 socket=7
> >> >
> >> > 2025-12-03 14:22:31.460: pcp_main pid 1085089: LOCATION:
> pcp_child.c:327
> >> >
> >> > 2025-12-03 14:22:31.721: pcp_main pid 1085089: LOG: PCP process with
> >> pid:
> >> > 1085093 exit with SUCCESS.
> >> >
> >> > 2025-12-03 14:22:31.721: pcp_main pid 1085089: LOCATION:
> pcp_child.c:384
> >> >
> >> > 2025-12-03 14:22:31.721: pcp_main pid 1085089: LOG: PCP process with
> >> pid:
> >> > 1085093 exits with status 0
> >> >
> >> > 2025-12-03 14:22:31.721: pcp_main pid 1085089: LOCATION:
> pcp_child.c:398
> >> >
> >> > 2025-12-03 14:25:39.480: child pid 1085050: LOG: failover or failback
> >> > event detected
> >> >
> >> > 2025-12-03 14:25:39.480: child pid 1085050: DETAIL: restarting myself
> >> >
> >> > 2025-12-03 14:25:39.480: child pid 1085050: LOCATION: child.c:1524
> >> >
> >> > 2025-12-03 14:25:39.480: child pid 1085038: LOG: failover or failback
> >> > event detected
> >> >
> >> > 2025-12-03 14:25:39.481: child pid 1085038: DETAIL: restarting myself
> >> >
> >> > 2025-12-03 14:25:39.481: child pid 1085038: LOCATION: child.c:1524
> >> >
> >> > 2025-12-03 14:25:39.481: child pid 1085035: LOG: failover or failback
> >> > event detected
> >> >
> >> > 2025-12-03 14:25:39.481: child pid 1085035: DETAIL: restarting myself
> >> >
> >> > 2025-12-03 14:25:39.481: child pid 1085035: LOCATION: child.c:1524
> >> >
> >> > 2025-12-03 14:25:39.481: child pid 1085061: LOG: failover or failback
> >> > event detected
> >> >
> >> > 2025-12-03 14:25:39.481: child pid 1085061: DETAIL: restarting myself
> >> >
> >> > 2025-12-03 14:25:39.481: child pid 1085061: LOCATION: child.c:1524
> >> >
> >> > 2025-12-03 14:25:39.483: child pid 1085053: LOG: failover or failback
> >> > event detected
> >> >
> >> > 2025-12-03 14:25:39.483: child pid 1085053: DETAIL: restarting myself
> >> >
> >> > 2025-12-03 14:25:39.483: child pid 1085053: LOCATION: child.c:1524
> >> >
> >> > 2025-12-03 14:25:39.483: child pid 1085059: LOG: failover or failback
> >> > event detected
> >> >
> >> > ......over and over and over again.
> >> >
> >> >
> >> >
> >> >
> >> > pcp_node_info output:
> >> >
> >> > 10.6.1.199 5432 1 0.500000 waiting up primary primary 0 none none
> >> > 2025-12-03 14:04:39
> >> >
> >> > 10.6.1.200 5432 1 0.500000 waiting up standby standby 0 streaming
> async
> >> > 2025-12-03 14:04:39
> >> >
> >> > Logs show:
> >> >
> >> > node status[0]: 1
> >> >
> >> > node status[1]: 2
> >> >
> >> > Node 0 (primary) gets status 1 (waiting), node 1 (standby) gets
> status 2
> >> > (up).
> >>
> >> No, this does not show the backend status. Instead, it says
> >>
> >> > node status[0]: 1
> >>
> >> This means backend 0 is primary.
> >>
> >> > node status[1]: 2
> >>
> >> This means backend 1 is standby.
> >>
> >> > *auto_failback behavior:*
> >> >
> >> > - When a node is detached (pcp_detach_node), it goes to status 3
> >> (down)
> >> > - auto_failback triggers and moves it to status 1 (waiting)
> >> > - Node never transitions from waiting to up
> >>
> >> Sounds like pgpool has not received queries.
> >>
> >> > *Key configuration:*
> >> >
> >> > backend_clustering_mode = 'streaming_replication'
> >> >
> >> > backend_hostname0 = '10.6.1.199'
> >> >
> >> > backend_hostname1 = '10.6.1.200'
> >> >
> >> > backend_application_name0 = 'nasdw_users_1'
> >> >
> >> > backend_application_name1 = 'nasdw_users_2'
> >> >
> >> >
> >> >
> >> > use_watchdog = on
> >> >
> >> > # 3 watchdog nodes configured
> >> >
> >> >
> >> >
> >> > auto_failback = on
> >> >
> >> > auto_failback_interval = 1
> >> >
> >> >
> >> >
> >> > sr_check_period = 10
> >> >
> >> > sr_check_user = 'pgpool'
> >> >
> >> > sr_check_database = 'nasdw_users'
> >> >
> >> >
> >> >
> >> > health_check_period = 1
> >> >
> >> > health_check_user = 'pgpool'
> >> >
> >> > health_check_database = 'nasdw_users'
> >> >
> >> >
> >> >
> >> > failover_when_quorum_exists = on (default)
> >> >
> >> > failover_require_consensus = on (default)
> >> > Cheers,
> >> > Adam
> >>
>

In response to

Responses

Browse pgpool-general by date

  From Date Subject
Next Message Tatsuo Ishii 2025-12-31 00:23:44 Re: Pgpool-II 4.7.0 released.
Previous Message Emond Papegaaij 2025-12-29 14:11:14 Re: Pgpool-II 4.7.0 released.