Re: Pgpool can't detect database status properly

From: Tatsuo Ishii <ishii(at)postgresql(dot)org>
To: adam(dot)blomeke(at)gmail(dot)com
Cc: pgpool-general(at)lists(dot)postgresql(dot)org
Subject: Re: Pgpool can't detect database status properly
Date: 2025-12-31 00:34:40
Message-ID: 20251231.093440.2297676836475238272.ishii@postgresql.org
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgpool-general

If you want the follow primary script to not execute, you can
temporarily turn off (set an empty string to "follow_primary_command")
and reload pgpool.conf before running pcp_detatch_node.

Best regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

> Follow up on this. I've got autofailback set to off, but it's executing the
> follow primary script on the node as soon as I attempt to detach it. Is
> this expected behavior?
>
> [postgres(at)awaprodxtrldbpgpool1 ~]$ grep auto_failback
> /etc/pgpool-II/pgpool.conf
> auto_failback = off
> auto_failback_interval = 1
> # Min interval of executing
> auto_failback in
> [postgres(at)awaprodxtrldbpgpool1 ~]$ # Checkpoint on primary
> psql -h 10.6.1.199 -d postgres -c "CHECKPOINT;"
> # Detach primary to trigger failover
> pcp_detach_node -h 10.6.1.54 -U pgpool -n 0
> # Verify node 1 is now primary
> pcp_node_info -a -U pgpool -h 10.6.1.54
> CHECKPOINT
> pcp_detach_node -- Command Successful
> 10.6.1.199 5432 3 0.500000 down up primary primary 0 none none 2025-12-30
> 14:22:10
> 10.6.1.200 5432 2 0.500000 up up standby standby 0 none none 2025-12-30
> 12:04:09
> [postgres(at)awaprodxtrldbpgpool1 ~]$ pcp_node_info -a -U pgpool -h 10.6.1.54
> 10.6.1.199 5432 3 0.500000 down down standby unknown 0 none none 2025-12-30
> 14:22:11
> 10.6.1.200 5432 2 0.500000 up up primary primary 0 none none 2025-12-30
> 14:22:11
> [postgres(at)awaprodxtrldbpgpool1 ~]$ ssh postgres(at)10(dot)6(dot)1(dot)199
> "/usr/pgsql-18/bin/pg_ctl -D /opt/data/data18 stop"
> Authorized uses only. All activity may be monitored and reported.
> pg_ctl: PID file "/opt/data/data18/postmaster.pid" does not exist
> Is server running?
> [postgres(at)awaprodxtrldbpgpool1 ~]$ ssh postgres(at)10(dot)6(dot)1(dot)199
> Authorized uses only. All activity may be monitored and reported.
> Last login: Tue Dec 30 11:45:09 2025 from 10.6.1.196
> -bash: typeset: TMOUT: readonly variable
> [postgres(at)awaproddbvmnasdwusers1 ~]$ cd /opt/data
> [postgres(at)awaproddbvmnasdwusers1 data]$ ll
> total 1236
> drwxr-x---. 2 postgres postgres 6 Dec 30 14:22 archive18
> drwx------. 23 postgres postgres 4096 Dec 29 12:46 data15
> drwx------. 13 postgres postgres 4096 Dec 30 14:22 data18
> drwx------. 2 postgres postgres 74 Dec 1 15:50 dbservercert
> -rw-r-----. 1 postgres postgres 1255731 Oct 24 18:04 pg_basebackup.log
> [postgres(at)awaproddbvmnasdwusers1 data]$ cd data18/
> [postgres(at)awaproddbvmnasdwusers1 data18]$ ll
> total 8
> -rw-------. 1 postgres postgres 233 Dec 30 14:22 backup_label
> drwx------. 6 postgres postgres 66 Dec 30 14:22 base
> drwx------. 2 postgres postgres 4096 Dec 30 14:22 global
> drwx------. 2 postgres postgres 26 Dec 30 14:22 pg_commit_ts
> drwx------. 2 postgres postgres 10 Dec 30 14:22 pg_dynshmem
> drwx------. 4 postgres postgres 48 Dec 30 14:22 pg_multixact
> drwx------. 2 postgres postgres 10 Dec 30 14:22 pg_notify
> drwx------. 2 postgres postgres 10 Dec 30 14:22 pg_serial
> drwx------. 2 postgres postgres 10 Dec 30 14:22 pg_snapshots
> drwx------. 2 postgres postgres 10 Dec 30 14:22 pg_subtrans
> drwx------. 2 postgres postgres 10 Dec 30 14:22 pg_twophase
> drwx------. 4 postgres postgres 121 Dec 30 14:22 pg_wal
> [postgres(at)awaproddbvmnasdwusers1 data18]$
>
> Cheers,
> Adam
>
>
> On Tue, Dec 16, 2025 at 5:29 PM Tatsuo Ishii <ishii(at)postgresql(dot)org> wrote:
>
>> > Thanks for the reply.
>> >
>> > Your response makes sense as I'm still setting this cluster up, so it's
>> > just me trying to connect to it.
>> >
>> > I'm curious then what the right process is for when I need to pull a node
>> > out of the cluster for maintenance (e.g. patching). I was under the
>> > impression that I should drop the node, do a pg_rewind, manually set it
>> as
>> > a standby if it was the primary, and then add the node back in pgpool. I
>> > guess I can't do that with auto failback turned on?
>>
>> Yes, while the maintenance, you should turn off auto failback.
>> In fact, it's written in the document:
>>
>> https://www.pgpool.net/docs/47/en/html/runtime-config-failover.html#RUNTIME-CONFIG-FAILOVER-SETTINGS
>>
>> If you plan to detach standby node for maintenance, set this
>> parameter to off beforehand. Otherwise it's possible that standby
>> node is reattached against your intention.
>>
>> Best regards,
>> --
>> Tatsuo Ishii
>> SRA OSS K.K.
>> English: http://www.sraoss.co.jp/index_en/
>> Japanese:http://www.sraoss.co.jp
>>
>> > Cheers,
>> > Adam
>> >
>> >
>> > On Mon, Dec 15, 2025 at 6:43 PM Tatsuo Ishii <ishii(at)postgresql(dot)org>
>> wrote:
>> >
>> >> > I'm resending this as it's been sitting in the moderation queue for a
>> >> > while. Possibly because I didn't have a subject line? Anyways, any
>> help
>> >> > would be great. Thanks!
>> >>
>> >> I received your email this time.
>> >>
>> >> > I’m setting up a pgpool cluster to replace a single node database in
>> my
>> >> > environment. The single node is separate from the cluster at the
>> moment.
>> >> > When it’s time to implement the DB I’m going to redo the
>> backup/restore,
>> >> > throw an upgrade from pg15->18, and then bring the cluster and take
>> over
>> >> > the old IP.
>> >> >
>> >> >
>> >> >
>> >> > *Environment:*
>> >> >
>> >> > - pgpool-II version: 4.6.3 (chirikoboshi)
>> >> > - PostgreSQL version: 18
>> >> > - OS: RHEL9
>> >> > - Cluster topology: 3 pgpool nodes (10.6.1.196, 10.6.1.197,
>> >> 10.6.1.198)
>> >> > + 2 PostgreSQL nodes (10.6.1.199 primary, 10.6.1.200 standby)
>> >> >
>> >> >
>> >> >
>> >> > *Issue:*
>> >> >
>> >> > I have pgpool configured and I’ve set it up using the scripts and
>> config
>> >> > files from a different instance, one which has been running just fine
>> >> for a
>> >> > year and a half or so. The issue I’m experiencing is that when I
>> >> > detach/reattach a node, it sits in waiting constantly. It never
>> >> transitions
>> >> > to up.
>> >>
>> >> If you connect to 10.6.1.196 (or 10.6.1.197, 10.6.1.198) using psql
>> >> and issue an SQL command, for example "SELECT 1", does it work? If it
>> >> works, it means pgpool works fine.
>> >>
>> >> > I have to manually change the status file to up for it to get to
>> >> > agree that it is,
>> >>
>> >> The pgpool status "waiting" means that the backend node has never
>> >> revceived any query from pgpool clients yet. You can safely assume
>> >> that that pgpool is up and running. Once pgpool receives queries, the
>> >> status should be changed from "waiting" to "up".
>> >>
>> >> > and when I try to drop the node it doesn't actually drop
>> >> > it. It just goes into waiting again.
>> >>
>> >> Sounds like an effect of auto fail back. because you set:
>> >>
>> >> auto_failback_interval = 1
>> >>
>> >> pgpool almost immediately brings the pgpool to online.
>> >>
>> >> > I also don’t see any connection
>> >> > attempts from the pgpool server to the postgres nodes if I look at
>> >> postgres
>> >> > logs. I've confirmed that it can run the postgres commands from the
>> >> command
>> >> > line. I've tried this both running pgpool as a service and running it
>> >> > directly from the command line. No difference in behavior.
>> >>
>> >> Probably there's something wrong in the configuration or trying to
>> >> connect to wrong IP and/or port. Please turn on log_client_messages
>> >> and log_per_node_statement, then send an SQL command to pgpool, and
>> >> examin the pgpool log.
>> >>
>> >> > Here’s the log output:
>> >> >
>> >> > 2025-12-03 14:20:49.037: main pid 1085028: LOG: === Starting fail
>> back.
>> >> > reconnect host 10.6.1.200(5432) ===
>> >> >
>> >> > 2025-12-03 14:20:49.037: main pid 1085028: LOCATION:
>> pgpool_main.c:4169
>> >> >
>> >> > 2025-12-03 14:20:49.037: main pid 1085028: LOG: Node 0 is not down
>> >> > (status: 2)
>> >> >
>> >> > 2025-12-03 14:20:49.037: main pid 1085028: LOCATION:
>> pgpool_main.c:1524
>> >> >
>> >> > 2025-12-03 14:20:49.038: main pid 1085028: LOG: Do not restart
>> children
>> >> > because we are failing back node id 1 host: 10.6.1.200 port: 5432 and
>> we
>> >> > are in streaming replication mode and not all backends were down
>> >> >
>> >> > 2025-12-03 14:20:49.038: main pid 1085028: LOCATION:
>> pgpool_main.c:4370
>> >> >
>> >> > 2025-12-03 14:20:49.038: main pid 1085028: LOG:
>> >> > find_primary_node_repeatedly: waiting for finding a primary node
>> >> >
>> >> > 2025-12-03 14:20:49.038: main pid 1085028: LOCATION:
>> pgpool_main.c:2896
>> >> >
>> >> > 2025-12-03 14:20:49.189: main pid 1085028: LOG: find_primary_node:
>> >> primary
>> >> > node is 0
>> >> >
>> >> > 2025-12-03 14:20:49.189: main pid 1085028: LOCATION:
>> pgpool_main.c:2815
>> >> >
>> >> > 2025-12-03 14:20:49.189: main pid 1085028: LOG: find_primary_node:
>> >> standby
>> >> > node is 1
>> >> >
>> >> > 2025-12-03 14:20:49.189: main pid 1085028: LOCATION:
>> pgpool_main.c:2821
>> >> >
>> >> > 2025-12-03 14:20:49.189: main pid 1085028: LOG: failover: set new
>> >> primary
>> >> > node: 0
>> >> >
>> >> > 2025-12-03 14:20:49.189: main pid 1085028: LOCATION:
>> pgpool_main.c:4660
>> >> >
>> >> > 2025-12-03 14:20:49.189: main pid 1085028: LOG: failover: set new
>> main
>> >> > node: 0
>> >> >
>> >> > 2025-12-03 14:20:49.189: main pid 1085028: LOCATION:
>> pgpool_main.c:4667
>> >> >
>> >> > 2025-12-03 14:20:49.189: main pid 1085028: LOG: === Failback done.
>> >> > reconnect host 10.6.1.200(5432) ===
>> >> >
>> >> > 2025-12-03 14:20:49.189: main pid 1085028: LOCATION:
>> pgpool_main.c:4763
>> >> >
>> >> > 2025-12-03 14:20:49.189: sr_check_worker pid 1085088: LOG: worker
>> >> process
>> >> > received restart request
>> >> >
>> >> > 2025-12-03 14:20:49.189: sr_check_worker pid 1085088: LOCATION:
>> >> > pool_worker_child.c:182
>> >> >
>> >> > 2025-12-03 14:20:50.189: pcp_main pid 1085087: LOG: restart request
>> >> > received in pcp child process
>> >> >
>> >> > 2025-12-03 14:20:50.189: pcp_main pid 1085087: LOCATION:
>> pcp_child.c:173
>> >> >
>> >> > 2025-12-03 14:20:50.193: main pid 1085028: LOG: PCP child 1085087
>> exits
>> >> > with status 0 in failover()
>> >> >
>> >> > 2025-12-03 14:20:50.193: main pid 1085028: LOCATION:
>> pgpool_main.c:4850
>> >> >
>> >> > 2025-12-03 14:20:50.193: main pid 1085028: LOG: fork a new PCP child
>> pid
>> >> > 1085089 in failover()
>> >> >
>> >> > 2025-12-03 14:20:50.193: main pid 1085028: LOCATION:
>> pgpool_main.c:4854
>> >> >
>> >> > 2025-12-03 14:20:50.193: pcp_main pid 1085089: LOG: PCP process:
>> 1085089
>> >> > started
>> >> >
>> >> > 2025-12-03 14:20:50.193: pcp_main pid 1085089: LOCATION:
>> pcp_child.c:165
>> >> >
>> >> > 2025-12-03 14:20:50.194: sr_check_worker pid 1085090: LOG: process
>> >> started
>> >> >
>> >> > 2025-12-03 14:20:50.194: sr_check_worker pid 1085090: LOCATION:
>> >> > pgpool_main.c:905
>> >> >
>> >> > 2025-12-03 14:22:31.460: pcp_main pid 1085089: LOG: forked new pcp
>> >> worker,
>> >> > pid=1085093 socket=7
>> >> >
>> >> > 2025-12-03 14:22:31.460: pcp_main pid 1085089: LOCATION:
>> pcp_child.c:327
>> >> >
>> >> > 2025-12-03 14:22:31.721: pcp_main pid 1085089: LOG: PCP process with
>> >> pid:
>> >> > 1085093 exit with SUCCESS.
>> >> >
>> >> > 2025-12-03 14:22:31.721: pcp_main pid 1085089: LOCATION:
>> pcp_child.c:384
>> >> >
>> >> > 2025-12-03 14:22:31.721: pcp_main pid 1085089: LOG: PCP process with
>> >> pid:
>> >> > 1085093 exits with status 0
>> >> >
>> >> > 2025-12-03 14:22:31.721: pcp_main pid 1085089: LOCATION:
>> pcp_child.c:398
>> >> >
>> >> > 2025-12-03 14:25:39.480: child pid 1085050: LOG: failover or failback
>> >> > event detected
>> >> >
>> >> > 2025-12-03 14:25:39.480: child pid 1085050: DETAIL: restarting myself
>> >> >
>> >> > 2025-12-03 14:25:39.480: child pid 1085050: LOCATION: child.c:1524
>> >> >
>> >> > 2025-12-03 14:25:39.480: child pid 1085038: LOG: failover or failback
>> >> > event detected
>> >> >
>> >> > 2025-12-03 14:25:39.481: child pid 1085038: DETAIL: restarting myself
>> >> >
>> >> > 2025-12-03 14:25:39.481: child pid 1085038: LOCATION: child.c:1524
>> >> >
>> >> > 2025-12-03 14:25:39.481: child pid 1085035: LOG: failover or failback
>> >> > event detected
>> >> >
>> >> > 2025-12-03 14:25:39.481: child pid 1085035: DETAIL: restarting myself
>> >> >
>> >> > 2025-12-03 14:25:39.481: child pid 1085035: LOCATION: child.c:1524
>> >> >
>> >> > 2025-12-03 14:25:39.481: child pid 1085061: LOG: failover or failback
>> >> > event detected
>> >> >
>> >> > 2025-12-03 14:25:39.481: child pid 1085061: DETAIL: restarting myself
>> >> >
>> >> > 2025-12-03 14:25:39.481: child pid 1085061: LOCATION: child.c:1524
>> >> >
>> >> > 2025-12-03 14:25:39.483: child pid 1085053: LOG: failover or failback
>> >> > event detected
>> >> >
>> >> > 2025-12-03 14:25:39.483: child pid 1085053: DETAIL: restarting myself
>> >> >
>> >> > 2025-12-03 14:25:39.483: child pid 1085053: LOCATION: child.c:1524
>> >> >
>> >> > 2025-12-03 14:25:39.483: child pid 1085059: LOG: failover or failback
>> >> > event detected
>> >> >
>> >> > ......over and over and over again.
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > pcp_node_info output:
>> >> >
>> >> > 10.6.1.199 5432 1 0.500000 waiting up primary primary 0 none none
>> >> > 2025-12-03 14:04:39
>> >> >
>> >> > 10.6.1.200 5432 1 0.500000 waiting up standby standby 0 streaming
>> async
>> >> > 2025-12-03 14:04:39
>> >> >
>> >> > Logs show:
>> >> >
>> >> > node status[0]: 1
>> >> >
>> >> > node status[1]: 2
>> >> >
>> >> > Node 0 (primary) gets status 1 (waiting), node 1 (standby) gets
>> status 2
>> >> > (up).
>> >>
>> >> No, this does not show the backend status. Instead, it says
>> >>
>> >> > node status[0]: 1
>> >>
>> >> This means backend 0 is primary.
>> >>
>> >> > node status[1]: 2
>> >>
>> >> This means backend 1 is standby.
>> >>
>> >> > *auto_failback behavior:*
>> >> >
>> >> > - When a node is detached (pcp_detach_node), it goes to status 3
>> >> (down)
>> >> > - auto_failback triggers and moves it to status 1 (waiting)
>> >> > - Node never transitions from waiting to up
>> >>
>> >> Sounds like pgpool has not received queries.
>> >>
>> >> > *Key configuration:*
>> >> >
>> >> > backend_clustering_mode = 'streaming_replication'
>> >> >
>> >> > backend_hostname0 = '10.6.1.199'
>> >> >
>> >> > backend_hostname1 = '10.6.1.200'
>> >> >
>> >> > backend_application_name0 = 'nasdw_users_1'
>> >> >
>> >> > backend_application_name1 = 'nasdw_users_2'
>> >> >
>> >> >
>> >> >
>> >> > use_watchdog = on
>> >> >
>> >> > # 3 watchdog nodes configured
>> >> >
>> >> >
>> >> >
>> >> > auto_failback = on
>> >> >
>> >> > auto_failback_interval = 1
>> >> >
>> >> >
>> >> >
>> >> > sr_check_period = 10
>> >> >
>> >> > sr_check_user = 'pgpool'
>> >> >
>> >> > sr_check_database = 'nasdw_users'
>> >> >
>> >> >
>> >> >
>> >> > health_check_period = 1
>> >> >
>> >> > health_check_user = 'pgpool'
>> >> >
>> >> > health_check_database = 'nasdw_users'
>> >> >
>> >> >
>> >> >
>> >> > failover_when_quorum_exists = on (default)
>> >> >
>> >> > failover_require_consensus = on (default)
>> >> > Cheers,
>> >> > Adam
>> >>
>>

In response to

Browse pgpool-general by date

  From Date Subject
Next Message Emond Papegaaij 2025-12-31 06:57:51 Re: Pgpool-II 4.7.0 released.
Previous Message Tatsuo Ishii 2025-12-31 00:23:44 Re: Pgpool-II 4.7.0 released.