Pgpool can't detect database status properly

From: Adam Blomeke <adam(dot)blomeke(at)gmail(dot)com>
To: pgpool-general(at)lists(dot)postgresql(dot)org
Subject: Pgpool can't detect database status properly
Date: 2025-12-12 21:25:59
Message-ID: CAG9Amsj61Oy7YGJw40uv7fedZwabLu2b5dVmJK4aDrZLDMVj+w@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgpool-general

I'm resending this as it's been sitting in the moderation queue for a
while. Possibly because I didn't have a subject line? Anyways, any help
would be great. Thanks!

I’m setting up a pgpool cluster to replace a single node database in my
environment. The single node is separate from the cluster at the moment.
When it’s time to implement the DB I’m going to redo the backup/restore,
throw an upgrade from pg15->18, and then bring the cluster and take over
the old IP.

*Environment:*

- pgpool-II version: 4.6.3 (chirikoboshi)
- PostgreSQL version: 18
- OS: RHEL9
- Cluster topology: 3 pgpool nodes (10.6.1.196, 10.6.1.197, 10.6.1.198)
+ 2 PostgreSQL nodes (10.6.1.199 primary, 10.6.1.200 standby)

*Issue:*

I have pgpool configured and I’ve set it up using the scripts and config
files from a different instance, one which has been running just fine for a
year and a half or so. The issue I’m experiencing is that when I
detach/reattach a node, it sits in waiting constantly. It never transitions
to up. I have to manually change the status file to up for it to get to
agree that it is, and when I try to drop the node it doesn't actually drop
it. It just goes into waiting again. I also don’t see any connection
attempts from the pgpool server to the postgres nodes if I look at postgres
logs. I've confirmed that it can run the postgres commands from the command
line. I've tried this both running pgpool as a service and running it
directly from the command line. No difference in behavior.

Here’s the log output:

2025-12-03 14:20:49.037: main pid 1085028: LOG: === Starting fail back.
reconnect host 10.6.1.200(5432) ===

2025-12-03 14:20:49.037: main pid 1085028: LOCATION: pgpool_main.c:4169

2025-12-03 14:20:49.037: main pid 1085028: LOG: Node 0 is not down
(status: 2)

2025-12-03 14:20:49.037: main pid 1085028: LOCATION: pgpool_main.c:1524

2025-12-03 14:20:49.038: main pid 1085028: LOG: Do not restart children
because we are failing back node id 1 host: 10.6.1.200 port: 5432 and we
are in streaming replication mode and not all backends were down

2025-12-03 14:20:49.038: main pid 1085028: LOCATION: pgpool_main.c:4370

2025-12-03 14:20:49.038: main pid 1085028: LOG:
find_primary_node_repeatedly: waiting for finding a primary node

2025-12-03 14:20:49.038: main pid 1085028: LOCATION: pgpool_main.c:2896

2025-12-03 14:20:49.189: main pid 1085028: LOG: find_primary_node: primary
node is 0

2025-12-03 14:20:49.189: main pid 1085028: LOCATION: pgpool_main.c:2815

2025-12-03 14:20:49.189: main pid 1085028: LOG: find_primary_node: standby
node is 1

2025-12-03 14:20:49.189: main pid 1085028: LOCATION: pgpool_main.c:2821

2025-12-03 14:20:49.189: main pid 1085028: LOG: failover: set new primary
node: 0

2025-12-03 14:20:49.189: main pid 1085028: LOCATION: pgpool_main.c:4660

2025-12-03 14:20:49.189: main pid 1085028: LOG: failover: set new main
node: 0

2025-12-03 14:20:49.189: main pid 1085028: LOCATION: pgpool_main.c:4667

2025-12-03 14:20:49.189: main pid 1085028: LOG: === Failback done.
reconnect host 10.6.1.200(5432) ===

2025-12-03 14:20:49.189: main pid 1085028: LOCATION: pgpool_main.c:4763

2025-12-03 14:20:49.189: sr_check_worker pid 1085088: LOG: worker process
received restart request

2025-12-03 14:20:49.189: sr_check_worker pid 1085088: LOCATION:
pool_worker_child.c:182

2025-12-03 14:20:50.189: pcp_main pid 1085087: LOG: restart request
received in pcp child process

2025-12-03 14:20:50.189: pcp_main pid 1085087: LOCATION: pcp_child.c:173

2025-12-03 14:20:50.193: main pid 1085028: LOG: PCP child 1085087 exits
with status 0 in failover()

2025-12-03 14:20:50.193: main pid 1085028: LOCATION: pgpool_main.c:4850

2025-12-03 14:20:50.193: main pid 1085028: LOG: fork a new PCP child pid
1085089 in failover()

2025-12-03 14:20:50.193: main pid 1085028: LOCATION: pgpool_main.c:4854

2025-12-03 14:20:50.193: pcp_main pid 1085089: LOG: PCP process: 1085089
started

2025-12-03 14:20:50.193: pcp_main pid 1085089: LOCATION: pcp_child.c:165

2025-12-03 14:20:50.194: sr_check_worker pid 1085090: LOG: process started

2025-12-03 14:20:50.194: sr_check_worker pid 1085090: LOCATION:
pgpool_main.c:905

2025-12-03 14:22:31.460: pcp_main pid 1085089: LOG: forked new pcp worker,
pid=1085093 socket=7

2025-12-03 14:22:31.460: pcp_main pid 1085089: LOCATION: pcp_child.c:327

2025-12-03 14:22:31.721: pcp_main pid 1085089: LOG: PCP process with pid:
1085093 exit with SUCCESS.

2025-12-03 14:22:31.721: pcp_main pid 1085089: LOCATION: pcp_child.c:384

2025-12-03 14:22:31.721: pcp_main pid 1085089: LOG: PCP process with pid:
1085093 exits with status 0

2025-12-03 14:22:31.721: pcp_main pid 1085089: LOCATION: pcp_child.c:398

2025-12-03 14:25:39.480: child pid 1085050: LOG: failover or failback
event detected

2025-12-03 14:25:39.480: child pid 1085050: DETAIL: restarting myself

2025-12-03 14:25:39.480: child pid 1085050: LOCATION: child.c:1524

2025-12-03 14:25:39.480: child pid 1085038: LOG: failover or failback
event detected

2025-12-03 14:25:39.481: child pid 1085038: DETAIL: restarting myself

2025-12-03 14:25:39.481: child pid 1085038: LOCATION: child.c:1524

2025-12-03 14:25:39.481: child pid 1085035: LOG: failover or failback
event detected

2025-12-03 14:25:39.481: child pid 1085035: DETAIL: restarting myself

2025-12-03 14:25:39.481: child pid 1085035: LOCATION: child.c:1524

2025-12-03 14:25:39.481: child pid 1085061: LOG: failover or failback
event detected

2025-12-03 14:25:39.481: child pid 1085061: DETAIL: restarting myself

2025-12-03 14:25:39.481: child pid 1085061: LOCATION: child.c:1524

2025-12-03 14:25:39.483: child pid 1085053: LOG: failover or failback
event detected

2025-12-03 14:25:39.483: child pid 1085053: DETAIL: restarting myself

2025-12-03 14:25:39.483: child pid 1085053: LOCATION: child.c:1524

2025-12-03 14:25:39.483: child pid 1085059: LOG: failover or failback
event detected

......over and over and over again.

pcp_node_info output:

10.6.1.199 5432 1 0.500000 waiting up primary primary 0 none none
2025-12-03 14:04:39

10.6.1.200 5432 1 0.500000 waiting up standby standby 0 streaming async
2025-12-03 14:04:39

Logs show:

node status[0]: 1

node status[1]: 2

Node 0 (primary) gets status 1 (waiting), node 1 (standby) gets status 2
(up).

*auto_failback behavior:*

- When a node is detached (pcp_detach_node), it goes to status 3 (down)
- auto_failback triggers and moves it to status 1 (waiting)
- Node never transitions from waiting to up

*Key configuration:*

backend_clustering_mode = 'streaming_replication'

backend_hostname0 = '10.6.1.199'

backend_hostname1 = '10.6.1.200'

backend_application_name0 = 'nasdw_users_1'

backend_application_name1 = 'nasdw_users_2'

use_watchdog = on

# 3 watchdog nodes configured

auto_failback = on

auto_failback_interval = 1

sr_check_period = 10

sr_check_user = 'pgpool'

sr_check_database = 'nasdw_users'

health_check_period = 1

health_check_user = 'pgpool'

health_check_database = 'nasdw_users'

failover_when_quorum_exists = on (default)

failover_require_consensus = on (default)
Cheers,
Adam

Responses

Browse pgpool-general by date

  From Date Subject
Next Message Koshino Taiki 2025-12-15 06:33:29 Pgpool-II 4.6.5, 4.5.10, 4.4.15 and 4.3.18 are now officially released.
Previous Message Koshino Taiki 2025-12-09 04:02:22 Pgpool-II 4.7 RC1 released.