please help! losing my subscriber node

From: Achilleas Mantzios <achill(at)matrix(dot)gatewaynet(dot)com>
To: pgsql-admin(at)lists(dot)postgresql(dot)org
Subject: please help! losing my subscriber node
Date: 2020-06-16 16:04:05
Message-ID: 6fa054d8-ad14-42a2-8926-5d79c97ecd65@matrix.gatewaynet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

Hello

my topology/environment is :

test linux , pgsql 10.13 (subscription) <--- prod linux, pgsql 10.13
(publication)

In order to connect to our accounting system (ms sql) , I needed to do
cross-db queries, so I setup freetds (1.1.42) and tds_fdw-2.0.1 and
tested them on the test system (subscription node). This involved the
IMPORT of some 1000 ms sql tables in a dedicated schema, plus a few
others in a second dedicated schema. All went fine, most tests worked
nice, so I thought of installing tds_fdw in the prod host. Needless to
say, the new mssql dedicated schemas did not involve any logical
replication . After importing the 1000 tables in prod with,

IMPORT FOREIGN SCHEMA dbo EXCEPT ("....") FROM SERVER mssql_bdynacom_srv
INTO mssql_bdynacom OPTIONS (import_default 'true');

after 15 mins (wal_sender_timeout / wal_receiver_timeout) I started
getting in the test node :

 [120394] 5ee085dc.1d64a 2020-06-16 17:10:26.450 EEST  @ line:5 ERROR: 
terminating logical replication worker due to timeout
 [87028] 5ee8d2d2.153f4 2020-06-16 17:10:26.467 EEST  @ line:1 LOG: 
logical replication apply worker for subscription
"data_for_testsmadb_pub" has started
 [87028] 5ee8d2d2.153f4 2020-06-16 17:10:26.545 EEST  @ line:2 ERROR: 
could not start WAL streaming: ERROR:  replication slot
"data_for_testsmadb_pub" is active for PID 10991

.....

in the meantime the "wal sender" on the prod node is idle. And seems
stuck. Apparently the addition of those 1000 foreign tables messed up
with the replication slot. I tried to move forward the replication slot,
but even right after restarting the two systems , I can't seem to find
any handle to call : |pg_replication_origin_advance |

||dynacom=# select * from pg_replication_origin;
 roident | roname
---------+--------
(0 rows)

dynacom=# select * from pg_replication_origin_status ;
 local_id | external_id | remote_lsn | local_lsn
----------+-------------+------------+-----------
(0 rows)

dynacom=#
||||||

So, I am pretty much at loss here . Go the guarantied way : drop the two
mssql schemas (i.e. revert to pre-incident state) and re-doing the whole
subscription node initialization and setup from scratch seems like a
possible scenario but I'd like to avoid that, logical replication works
fine for two years, and it will take maybe days to catch up, I am not at
the office, I work from home and this makes things worse. Another option
is to leave the tds_fdw and the foreign table definitions and re-do the
logical replication setup. Please help!

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Wells Oliver 2020-06-16 16:47:47 Re: Deleting more efficiently from large partitions
Previous Message Pepe TD Vo 2020-06-16 15:42:18 Re: create batch script to import into postgres tables