| From: | Tatsuo Ishii <ishii(at)postgresql(dot)org> |
|---|---|
| To: | adam(dot)blomeke(at)gmail(dot)com |
| Cc: | pgpool-general(at)lists(dot)postgresql(dot)org |
| Subject: | Re: Pgpool can't detect database status properly |
| Date: | 2025-12-17 01:29:35 |
| Message-ID: | 20251217.102935.700318527231632102.ishii@postgresql.org |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgpool-general |
> Thanks for the reply.
>
> Your response makes sense as I'm still setting this cluster up, so it's
> just me trying to connect to it.
>
> I'm curious then what the right process is for when I need to pull a node
> out of the cluster for maintenance (e.g. patching). I was under the
> impression that I should drop the node, do a pg_rewind, manually set it as
> a standby if it was the primary, and then add the node back in pgpool. I
> guess I can't do that with auto failback turned on?
Yes, while the maintenance, you should turn off auto failback.
In fact, it's written in the document:
https://www.pgpool.net/docs/47/en/html/runtime-config-failover.html#RUNTIME-CONFIG-FAILOVER-SETTINGS
If you plan to detach standby node for maintenance, set this
parameter to off beforehand. Otherwise it's possible that standby
node is reattached against your intention.
Best regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp
> Cheers,
> Adam
>
>
> On Mon, Dec 15, 2025 at 6:43 PM Tatsuo Ishii <ishii(at)postgresql(dot)org> wrote:
>
>> > I'm resending this as it's been sitting in the moderation queue for a
>> > while. Possibly because I didn't have a subject line? Anyways, any help
>> > would be great. Thanks!
>>
>> I received your email this time.
>>
>> > I’m setting up a pgpool cluster to replace a single node database in my
>> > environment. The single node is separate from the cluster at the moment.
>> > When it’s time to implement the DB I’m going to redo the backup/restore,
>> > throw an upgrade from pg15->18, and then bring the cluster and take over
>> > the old IP.
>> >
>> >
>> >
>> > *Environment:*
>> >
>> > - pgpool-II version: 4.6.3 (chirikoboshi)
>> > - PostgreSQL version: 18
>> > - OS: RHEL9
>> > - Cluster topology: 3 pgpool nodes (10.6.1.196, 10.6.1.197,
>> 10.6.1.198)
>> > + 2 PostgreSQL nodes (10.6.1.199 primary, 10.6.1.200 standby)
>> >
>> >
>> >
>> > *Issue:*
>> >
>> > I have pgpool configured and I’ve set it up using the scripts and config
>> > files from a different instance, one which has been running just fine
>> for a
>> > year and a half or so. The issue I’m experiencing is that when I
>> > detach/reattach a node, it sits in waiting constantly. It never
>> transitions
>> > to up.
>>
>> If you connect to 10.6.1.196 (or 10.6.1.197, 10.6.1.198) using psql
>> and issue an SQL command, for example "SELECT 1", does it work? If it
>> works, it means pgpool works fine.
>>
>> > I have to manually change the status file to up for it to get to
>> > agree that it is,
>>
>> The pgpool status "waiting" means that the backend node has never
>> revceived any query from pgpool clients yet. You can safely assume
>> that that pgpool is up and running. Once pgpool receives queries, the
>> status should be changed from "waiting" to "up".
>>
>> > and when I try to drop the node it doesn't actually drop
>> > it. It just goes into waiting again.
>>
>> Sounds like an effect of auto fail back. because you set:
>>
>> auto_failback_interval = 1
>>
>> pgpool almost immediately brings the pgpool to online.
>>
>> > I also don’t see any connection
>> > attempts from the pgpool server to the postgres nodes if I look at
>> postgres
>> > logs. I've confirmed that it can run the postgres commands from the
>> command
>> > line. I've tried this both running pgpool as a service and running it
>> > directly from the command line. No difference in behavior.
>>
>> Probably there's something wrong in the configuration or trying to
>> connect to wrong IP and/or port. Please turn on log_client_messages
>> and log_per_node_statement, then send an SQL command to pgpool, and
>> examin the pgpool log.
>>
>> > Here’s the log output:
>> >
>> > 2025-12-03 14:20:49.037: main pid 1085028: LOG: === Starting fail back.
>> > reconnect host 10.6.1.200(5432) ===
>> >
>> > 2025-12-03 14:20:49.037: main pid 1085028: LOCATION: pgpool_main.c:4169
>> >
>> > 2025-12-03 14:20:49.037: main pid 1085028: LOG: Node 0 is not down
>> > (status: 2)
>> >
>> > 2025-12-03 14:20:49.037: main pid 1085028: LOCATION: pgpool_main.c:1524
>> >
>> > 2025-12-03 14:20:49.038: main pid 1085028: LOG: Do not restart children
>> > because we are failing back node id 1 host: 10.6.1.200 port: 5432 and we
>> > are in streaming replication mode and not all backends were down
>> >
>> > 2025-12-03 14:20:49.038: main pid 1085028: LOCATION: pgpool_main.c:4370
>> >
>> > 2025-12-03 14:20:49.038: main pid 1085028: LOG:
>> > find_primary_node_repeatedly: waiting for finding a primary node
>> >
>> > 2025-12-03 14:20:49.038: main pid 1085028: LOCATION: pgpool_main.c:2896
>> >
>> > 2025-12-03 14:20:49.189: main pid 1085028: LOG: find_primary_node:
>> primary
>> > node is 0
>> >
>> > 2025-12-03 14:20:49.189: main pid 1085028: LOCATION: pgpool_main.c:2815
>> >
>> > 2025-12-03 14:20:49.189: main pid 1085028: LOG: find_primary_node:
>> standby
>> > node is 1
>> >
>> > 2025-12-03 14:20:49.189: main pid 1085028: LOCATION: pgpool_main.c:2821
>> >
>> > 2025-12-03 14:20:49.189: main pid 1085028: LOG: failover: set new
>> primary
>> > node: 0
>> >
>> > 2025-12-03 14:20:49.189: main pid 1085028: LOCATION: pgpool_main.c:4660
>> >
>> > 2025-12-03 14:20:49.189: main pid 1085028: LOG: failover: set new main
>> > node: 0
>> >
>> > 2025-12-03 14:20:49.189: main pid 1085028: LOCATION: pgpool_main.c:4667
>> >
>> > 2025-12-03 14:20:49.189: main pid 1085028: LOG: === Failback done.
>> > reconnect host 10.6.1.200(5432) ===
>> >
>> > 2025-12-03 14:20:49.189: main pid 1085028: LOCATION: pgpool_main.c:4763
>> >
>> > 2025-12-03 14:20:49.189: sr_check_worker pid 1085088: LOG: worker
>> process
>> > received restart request
>> >
>> > 2025-12-03 14:20:49.189: sr_check_worker pid 1085088: LOCATION:
>> > pool_worker_child.c:182
>> >
>> > 2025-12-03 14:20:50.189: pcp_main pid 1085087: LOG: restart request
>> > received in pcp child process
>> >
>> > 2025-12-03 14:20:50.189: pcp_main pid 1085087: LOCATION: pcp_child.c:173
>> >
>> > 2025-12-03 14:20:50.193: main pid 1085028: LOG: PCP child 1085087 exits
>> > with status 0 in failover()
>> >
>> > 2025-12-03 14:20:50.193: main pid 1085028: LOCATION: pgpool_main.c:4850
>> >
>> > 2025-12-03 14:20:50.193: main pid 1085028: LOG: fork a new PCP child pid
>> > 1085089 in failover()
>> >
>> > 2025-12-03 14:20:50.193: main pid 1085028: LOCATION: pgpool_main.c:4854
>> >
>> > 2025-12-03 14:20:50.193: pcp_main pid 1085089: LOG: PCP process: 1085089
>> > started
>> >
>> > 2025-12-03 14:20:50.193: pcp_main pid 1085089: LOCATION: pcp_child.c:165
>> >
>> > 2025-12-03 14:20:50.194: sr_check_worker pid 1085090: LOG: process
>> started
>> >
>> > 2025-12-03 14:20:50.194: sr_check_worker pid 1085090: LOCATION:
>> > pgpool_main.c:905
>> >
>> > 2025-12-03 14:22:31.460: pcp_main pid 1085089: LOG: forked new pcp
>> worker,
>> > pid=1085093 socket=7
>> >
>> > 2025-12-03 14:22:31.460: pcp_main pid 1085089: LOCATION: pcp_child.c:327
>> >
>> > 2025-12-03 14:22:31.721: pcp_main pid 1085089: LOG: PCP process with
>> pid:
>> > 1085093 exit with SUCCESS.
>> >
>> > 2025-12-03 14:22:31.721: pcp_main pid 1085089: LOCATION: pcp_child.c:384
>> >
>> > 2025-12-03 14:22:31.721: pcp_main pid 1085089: LOG: PCP process with
>> pid:
>> > 1085093 exits with status 0
>> >
>> > 2025-12-03 14:22:31.721: pcp_main pid 1085089: LOCATION: pcp_child.c:398
>> >
>> > 2025-12-03 14:25:39.480: child pid 1085050: LOG: failover or failback
>> > event detected
>> >
>> > 2025-12-03 14:25:39.480: child pid 1085050: DETAIL: restarting myself
>> >
>> > 2025-12-03 14:25:39.480: child pid 1085050: LOCATION: child.c:1524
>> >
>> > 2025-12-03 14:25:39.480: child pid 1085038: LOG: failover or failback
>> > event detected
>> >
>> > 2025-12-03 14:25:39.481: child pid 1085038: DETAIL: restarting myself
>> >
>> > 2025-12-03 14:25:39.481: child pid 1085038: LOCATION: child.c:1524
>> >
>> > 2025-12-03 14:25:39.481: child pid 1085035: LOG: failover or failback
>> > event detected
>> >
>> > 2025-12-03 14:25:39.481: child pid 1085035: DETAIL: restarting myself
>> >
>> > 2025-12-03 14:25:39.481: child pid 1085035: LOCATION: child.c:1524
>> >
>> > 2025-12-03 14:25:39.481: child pid 1085061: LOG: failover or failback
>> > event detected
>> >
>> > 2025-12-03 14:25:39.481: child pid 1085061: DETAIL: restarting myself
>> >
>> > 2025-12-03 14:25:39.481: child pid 1085061: LOCATION: child.c:1524
>> >
>> > 2025-12-03 14:25:39.483: child pid 1085053: LOG: failover or failback
>> > event detected
>> >
>> > 2025-12-03 14:25:39.483: child pid 1085053: DETAIL: restarting myself
>> >
>> > 2025-12-03 14:25:39.483: child pid 1085053: LOCATION: child.c:1524
>> >
>> > 2025-12-03 14:25:39.483: child pid 1085059: LOG: failover or failback
>> > event detected
>> >
>> > ......over and over and over again.
>> >
>> >
>> >
>> >
>> > pcp_node_info output:
>> >
>> > 10.6.1.199 5432 1 0.500000 waiting up primary primary 0 none none
>> > 2025-12-03 14:04:39
>> >
>> > 10.6.1.200 5432 1 0.500000 waiting up standby standby 0 streaming async
>> > 2025-12-03 14:04:39
>> >
>> > Logs show:
>> >
>> > node status[0]: 1
>> >
>> > node status[1]: 2
>> >
>> > Node 0 (primary) gets status 1 (waiting), node 1 (standby) gets status 2
>> > (up).
>>
>> No, this does not show the backend status. Instead, it says
>>
>> > node status[0]: 1
>>
>> This means backend 0 is primary.
>>
>> > node status[1]: 2
>>
>> This means backend 1 is standby.
>>
>> > *auto_failback behavior:*
>> >
>> > - When a node is detached (pcp_detach_node), it goes to status 3
>> (down)
>> > - auto_failback triggers and moves it to status 1 (waiting)
>> > - Node never transitions from waiting to up
>>
>> Sounds like pgpool has not received queries.
>>
>> > *Key configuration:*
>> >
>> > backend_clustering_mode = 'streaming_replication'
>> >
>> > backend_hostname0 = '10.6.1.199'
>> >
>> > backend_hostname1 = '10.6.1.200'
>> >
>> > backend_application_name0 = 'nasdw_users_1'
>> >
>> > backend_application_name1 = 'nasdw_users_2'
>> >
>> >
>> >
>> > use_watchdog = on
>> >
>> > # 3 watchdog nodes configured
>> >
>> >
>> >
>> > auto_failback = on
>> >
>> > auto_failback_interval = 1
>> >
>> >
>> >
>> > sr_check_period = 10
>> >
>> > sr_check_user = 'pgpool'
>> >
>> > sr_check_database = 'nasdw_users'
>> >
>> >
>> >
>> > health_check_period = 1
>> >
>> > health_check_user = 'pgpool'
>> >
>> > health_check_database = 'nasdw_users'
>> >
>> >
>> >
>> > failover_when_quorum_exists = on (default)
>> >
>> > failover_require_consensus = on (default)
>> > Cheers,
>> > Adam
>>
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Luca Ferrari | 2025-12-17 17:14:36 | pcp_promote_node and watchdog |
| Previous Message | Adam Blomeke | 2025-12-16 15:57:16 | Re: Pgpool can't detect database status properly |