Re: [Ext:] Re: Stream Replication not working

From: Allie Crawford <CrawfordMA(at)ChurchofJesusChrist(dot)org>
To: Sushant Postgres <sushant(dot)postgres(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: SATYANARAYANA NARLAPURAM <satyanarlapuram(at)gmail(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [Ext:] Re: Stream Replication not working
Date: 2022-01-19 01:09:53
Message-ID: CY4PR0401MB35565AA29CFF9825A8F4361DD9589@CY4PR0401MB3556.namprd04.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers

I was able to figure out what the problem was that was preventing my PostgreSQL Stream replication from replicating the changes even though all the health checks queries were showing that the replication had no problems.
I am a newbie and I am learning PostgreSQL on my own, so it took me a while to figure this out, so I am sharing this with the community in case somebody else in the future runs into the same problem.

1. In reading the postgresql documentation I found a view called pg_stat_activity that shows the session activity of the database showing the status and the wait events associate with the session
2. In checking the session activity in the pg_stat_activity view, I was able to identify the following:

-[ RECORD 2 ]----+--------------------------------

datid |

datname |

pid | 17333

leader_pid |

usesysid |

usename |

application_name |

client_addr |

client_hostname |

client_port |

backend_start | 2022-01-06 17:29:51.503073-07

xact_start |

query_start |

state_change |

wait_event_type | IPC

wait_event | RecoveryPause

state |

backend_xid |

backend_xmin |

query |

backend_type | startup

1. So I started to research the wait event “RecoveryPause” and I found a link to the postgresql documentation that explains all the recovery_target setttings https://www.postgresql.org/docs/9.5/recovery-target-settings.html

2. So I dediced to review all the recovery settings my cluster had in the postgresql.conf file and I found that I had the parameter recovery_target_time configured as follows recovery_target_time='2021-04-20 21:00:00 MST', and that is when I realized that this configuration was preventing me for applying the latest changes to the standby site, because this parameter basically sets the time up to which the recovery will proceed. This is the reason why all the health check queries where not showing any problems, because there was no problem at all, I just had a parameter misconfigured that was stopping the standby from applying any WAL files because the recovery target was set up to April 2021.
3. Once I figured this out, I disabled the recovery_target_time parameter on the standby site and bounce the postgresql cluster. As soon as the standby cluster was up and running again all the changes I did on January 6th (and that had been pending from being applied) were immediately applied and now the standby site is completely in sync with the master, and applying WAL files as they are being shipped to the standby site.

Thank you to all of you that sent me suggestions, even though the suggestions did not resolve the problem they gave me ideas on which direction I needed to go to continue troubleshooting the problem.

Regarding the entries on the pg_locks view (see details below in this email thread), they were showing normal activity but not an actual problem, so it did not point me to the problem but allowed me to learn about the view that shows active locks in the database. Thank you again for sharing this info with me.

postgresql=# select datname from pg_database where oid = 16384;

datname

------------

postgresql

(1 row)

postgresql=# select relname from pg_class where oid = 12141;

relname

----------

pg_locks

(1 row)

postgresql=#

Have a great week everyone.
Regards,
Allie

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Date: Tuesday, January 11, 2022 at 6:18 PM
To: Allie Crawford <CrawfordMA(at)ChurchofJesusChrist(dot)org>
Cc: sushant(dot)postgres(at)gmail(dot)com <sushant(dot)postgres(at)gmail(dot)com>, amit(dot)kapila16(at)gmail(dot)com <amit(dot)kapila16(at)gmail(dot)com>, satyanarlapuram(at)gmail(dot)com <satyanarlapuram(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [Ext:] Re: Stream Replication not working
Hi.

At Tue, 11 Jan 2022 15:05:55 +0000, Allie Crawford <CrawfordMA(at)ChurchofJesusChrist(dot)org> wrote in
> er it. How do I figure out which database and relation is db:16384
> and relation:12141.?

On any database,

select datname from pg_database where oid = 16384;

Then on the shown database,

select relname from pg_class where oid = 12141;

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

From: Allie Crawford <CrawfordMA(at)ChurchofJesusChrist(dot)org>
Date: Tuesday, January 11, 2022 at 8:05 AM
To: Sushant Postgres <sushant(dot)postgres(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: SATYANARAYANA NARLAPURAM <satyanarlapuram(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [Ext:] Re: Stream Replication not working
Amit,
Thank you for your help in trying to understand the information that the pg_locks table is showing. Regarding your question, I am not sure who to answer it. How do I figure out which database and relation is db:16384
and relation:12141.?

Thanks,
Allie

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Date: Tuesday, January 11, 2022 at 6:28 AM
To: Allie Crawford <CrawfordMA(at)ChurchofJesusChrist(dot)org>
Cc: SATYANARAYANA NARLAPURAM <satyanarlapuram(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [Ext:] Re: Stream Replication not working
It seems both master and standby have an exclusive lock on db:16384
and relation:12141. Which is this database/relation and why is the
app/database holding a lock on it?

--
With Regards,
Amit Kapila.

From: Allie Crawford <CrawfordMA(at)ChurchofJesusChrist(dot)org>
Date: Tuesday, January 11, 2022 at 7:47 AM
To: Sushant Postgres <sushant(dot)postgres(at)gmail(dot)com>
Cc: SATYANARAYANA NARLAPURAM <satyanarlapuram(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [Ext:] Re: Stream Replication not working
Satya,
I am a newbie on postgresql, I have no previous experience with postgresql and I need to get this replication working. In looking at the data that the pg_lock is showing, I do not know how to interpret it.
I will really appreciate any help you can give me in resolving this issue.
Regards,
Allie

From: Sushant Postgres <sushant(dot)postgres(at)gmail(dot)com>
Date: Tuesday, January 11, 2022 at 12:49 AM
To: Allie Crawford <CrawfordMA(at)ChurchofJesusChrist(dot)org>
Cc: SATYANARAYANA NARLAPURAM <satyanarlapuram(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [Ext:] Re: Stream Replication not working
Hi All,

for us also, logs are applying at slave server but very very slow. While checking we also have seen same set of locks to Master and Slave servers.
Please suggest the solution for that.
Many Thanks in Advance !!
Thanks

On Tue, Jan 11, 2022 at 2:12 AM Allie Crawford <CrawfordMA(at)churchofjesuschrist(dot)org<mailto:CrawfordMA(at)churchofjesuschrist(dot)org>> wrote:
Thank you so much for your help on this Satya. I have detailed right below the output of the query you asked me to run.

Master

postgresql@<master> ~>psql

psql (13.5)

Type "help" for help.

postgresql=# select * from pg_locks;

locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | pid | mode | granted | fastpath

------------+----------+----------+------+-------+------------+---------------+---------+-------+----------+--------------------+---------+-----------------+---------+----------

relation | 16384 | 12141 | | | | | | | | 3/6715 | 2669949 | AccessShareLock | t | t

virtualxid | | | | | 3/6715 | | | | | 3/6715 | 2669949 | ExclusiveLock | t | t

(2 rows)

postgresql=#

Standby
postgresql@<standby> ~>psql
psql (13.5)
Type "help" for help.

postgresql=# select * from pg_locks;
locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | pid | mode | granted | fastpath
------------+----------+----------+------+-------+------------+---------------+---------+-------+----------+--------------------+--------+-----------------+---------+----------
relation | 16384 | 12141 | | | | | | | | 2/50 | 642064 | AccessShareLock | t | t
virtualxid | | | | | 2/50 | | | | | 2/50 | 642064 | ExclusiveLock | t | t
virtualxid | | | | | 1/1 | | | | | 1/0 | 17333 | ExclusiveLock | t | t
(3 rows)

postgresql=#

From: SATYANARAYANA NARLAPURAM <satyanarlapuram(at)gmail(dot)com<mailto:satyanarlapuram(at)gmail(dot)com>>
Date: Monday, January 10, 2022 at 1:06 PM
To: Allie Crawford <CrawfordMA(at)ChurchofJesusChrist(dot)org>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org<mailto:pgsql-hackers(at)lists(dot)postgresql(dot)org> <pgsql-hackers(at)lists(dot)postgresql(dot)org<mailto:pgsql-hackers(at)lists(dot)postgresql(dot)org>>
Subject: [Ext:] Re: Stream Replication not working
[External Email]
Seems there is a problem with the replay on your standby. Either it is too slow or stuck behind some locks ( replay_lag of 20:38:47.00904 indicates this and the flush_lsn is the same as lsn on primary ) . Run pg_locks to see if the replay is stuck behind a lock.

On Mon, Jan 10, 2022 at 11:53 AM Allie Crawford <CrawfordMA(at)churchofjesuschrist(dot)org<mailto:CrawfordMA(at)churchofjesuschrist(dot)org>> wrote:
Hi All,
I have implemented Stream replication in one of my environments, and for some reason even though all the health checks are showing that the replication is working, when I run manual tests to see if changes are being replicated, the changes are not replicated to the standby postgresql environment. I have been researching for two day and I cannot find any documentation that talks about the case I am running into. I will appreciate if anybody could take a look at the details I have detailed below and give me some guidance on where the problem might be that is preventing my changes for being replicated. Even though I was able to instantiate the standby while firewalld was enabled, I decided to disable it just in case that it was causing any issue to the manual changes, but disabling firewalld has not had any effect, I am still not able to get the manual changes test to be replicated to the standby site. As you will see in the details below, the streaming is working, both sites are in sync to the latest WAL but for some reasons the latest changes are not on the standby site. How is it possible that the standby site is completely in sync but yet does not contain the latest changes?

Thanks in advance for any help you can give me with this problem.

Regards,
Allie

Details:

Master postgresql Environment

postgresql=# select * from pg_stat_replication;

-[ RECORD 1 ]----+------------------------------

pid | 1979089

usesysid | 16404

usename | replacct

application_name | walreceiver

client_addr | <standby server IP>

client_hostname | <standby server name>

client_port | 55096

backend_start | 2022-01-06 17:29:51.542784-07

backend_xmin |

state | streaming

sent_lsn | 0/35000788

write_lsn | 0/35000788

flush_lsn | 0/35000788

replay_lsn | 0/31000500

write_lag | 00:00:00.001611

flush_lag | 00:00:00.001693

replay_lag | 20:38:47.00904

sync_priority | 1

sync_state | sync

reply_time | 2022-01-07 14:11:58.996277-07

postgresql=#

postgresql=# select * from pg_roles;

rolname | rolsuper | rolinherit | rolcreaterole | rolcreatedb | rolcanlogin | rolreplication | rolconnlimit | rolpassword | rolvaliduntil | rolbypassrls | rolconfig | oid

---------------------------+----------+------------+---------------+-------------+-------------+----------------+--------------+-------------+---------------+--------------+-----------+-------

postgresql | t | t | t | t | t | t | -1 | ******** | | t | | 10

pg_monitor | f | t | f | f | f | f | -1 | ******** | | f | | 3373

pg_read_all_settings | f | t | f | f | f | f | -1 | ******** | | f | | 3374

pg_read_all_stats | f | t | f | f | f | f | -1 | ******** | | f | | 3375

pg_stat_scan_tables | f | t | f | f | f | f | -1 | ******** | | f | | 3377

pg_read_server_files | f | t | f | f | f | f | -1 | ******** | | f | | 4569

pg_write_server_files | f | t | f | f | f | f | -1 | ******** | | f | | 4570

pg_execute_server_program | f | t | f | f | f | f | -1 | ******** | | f | | 4571

pg_signal_backend | f | t | f | f | f | f | -1 | ******** | | f | | 4200

replacct | t | t | t | t | t | t | -1 | ******** | | t | | 16404

(10 rows)

postgresql=#

postgresql=# create database test_replication_3;

CREATE DATABASE

postgresql=#

postgresql=# select datname from pg_database;

datname

--------------------

postgres

postgresql

template1

template0

stream

test_replication

test_replication_2

test_replication_3

(8 rows)

postgresql=#

postgresql=# SELECT pg_current_wal_lsn();

pg_current_wal_lsn

--------------------

0/35000788

(1 row)

postgresql=#

Standby postgresql Environment
postgresql=# select * from pg_stat_wal_receiver;
-[ RECORD 1 ]---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pid | 17340
status | streaming
receive_start_lsn | 0/30000000
receive_start_tli | 1
written_lsn | 0/35000788
flushed_lsn | 0/35000788
received_tli | 1
last_msg_send_time | 2022-01-07 14:09:48.766823-07
last_msg_receipt_time | 2022-01-07 14:09:48.767581-07
latest_end_lsn | 0/35000788
latest_end_time | 2022-01-07 14:08:48.663693-07
slot_name | wal_req_x_replica
sender_host | <Master Server IP>
sender_port | <Master server postgresql port#>
conninfo | user=replacct password=******** channel_binding=prefer dbname=replication host=<Master server IP> port=<postgresql port#> fallback_application_name=walreceiver sslmode=prefer sslcompression=0 ssl_min_protocol_version=TLSv1.2 gssencmode=prefer krbsrvname=postgres target_session_attrs=any

postgresql=#

postgresql=# select datname from pg_database;
datname
------------
postgres
postgresql
template1
template0
stream
(5 rows)

postgresql=# select pg_last_wal_receive_lsn();
pg_last_wal_receive_lsn
-------------------------
0/35000788
(1 row)

postgresql=#

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Michael Paquier 2022-01-19 04:42:00 Re: could not accept SSL connection: Success
Previous Message Stephen Frost 2022-01-18 19:55:08 Re: WAL Archiving and base backup

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2022-01-19 01:11:16 Re: tab-complete COMPLETE_WITH_ATTR can become confused by table-lists.
Previous Message Andres Freund 2022-01-19 01:05:46 Re: Add last commit LSN to pg_last_committed_xact()