Re: please help! losing my subscriber node

From: Achilleas Mantzios <achill(at)matrix(dot)gatewaynet(dot)com>
To: pgsql-admin(at)lists(dot)postgresql(dot)org
Subject: Re: please help! losing my subscriber node
Date: 2020-06-17 14:24:07
Message-ID: c6cfe653-f4e0-f623-2dd2-25b3c4be3bd5@matrix.gatewaynet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

just an update to this.

On 16/6/20 7:04 μ.μ., Achilleas Mantzios wrote:
>
> Hello
>
> my topology/environment is :
>
> test linux , pgsql 10.13 (subscription) <--- prod linux, pgsql 10.13
> (publication)
>
> In order to connect to our accounting system (ms sql) , I needed to do
> cross-db queries, so I setup freetds (1.1.42) and tds_fdw-2.0.1 and
> tested them on the test system (subscription node). This involved the
> IMPORT of some 1000 ms sql tables in a dedicated schema, plus a few
> others in a second dedicated schema. All went fine, most tests worked
> nice, so I thought of installing tds_fdw in the prod host. Needless to
> say, the new mssql dedicated schemas did not involve any logical
> replication . After importing the 1000 tables in prod with,
>
> IMPORT FOREIGN SCHEMA dbo EXCEPT ("....") FROM SERVER
> mssql_bdynacom_srv INTO mssql_bdynacom OPTIONS (import_default 'true');
>
> after 15 mins (wal_sender_timeout / wal_receiver_timeout) I started
> getting in the test node :
>
>  [120394] 5ee085dc.1d64a 2020-06-16 17:10:26.450 EEST  @ line:5
> ERROR:  terminating logical replication worker due to timeout
>  [87028] 5ee8d2d2.153f4 2020-06-16 17:10:26.467 EEST  @ line:1 LOG: 
> logical replication apply worker for subscription
> "data_for_testsmadb_pub" has started
>  [87028] 5ee8d2d2.153f4 2020-06-16 17:10:26.545 EEST  @ line:2 ERROR: 
> could not start WAL streaming: ERROR:  replication slot
> "data_for_testsmadb_pub" is active for PID 10991
>
> .....
>
>
> in the meantime the "wal sender" on the prod node is idle. And seems
> stuck. Apparently the addition of those 1000 foreign tables messed up
> with the replication slot. I tried to move forward the replication
> slot, but even right after restarting the two systems , I can't seem
> to find any handle to call : |pg_replication_origin_advance |
>
> ||dynacom=# select * from pg_replication_origin;
>  roident | roname
> ---------+--------
> (0 rows)
>
> dynacom=# select * from pg_replication_origin_status ;
>  local_id | external_id | remote_lsn | local_lsn
> ----------+-------------+------------+-----------
> (0 rows)
>
> dynacom=#
>
> ||
>
apparently I was looking in the wrong node arghhh. It should have been
the subscriber node. Anyways.

> ||||
>
> So, I am pretty much at loss here . Go the guarantied way : drop the
> two mssql schemas (i.e. revert to pre-incident state) and re-doing the
> whole subscription node initialization and setup from scratch seems
> like a possible scenario but I'd like to avoid that, logical
> replication works fine for two years, and it will take maybe days to
> catch up, I am not at the office, I work from home and this makes
> things worse. Another option is to leave the tds_fdw and the foreign
> table definitions and re-do the logical replication setup. Please help!
>

I checked the number of tables the IMPORT FOREIGN SCHEMA defined : over
13000 . So in one transaction, I got 13000+ implicit CREATE FOREIGN
TABLE definitions. I overcame this by altering the subscriber to use a
new replication slot, created by hand the new replication slot on the
publisher side, and had to kill -9 the old wal sender , (otherwise the
db would not restart, and of course the old repl slot would not be
dropped).  Then I was able to drop the old replication slot, and finally
see the space in data/pg_wal decreasing at last. So the 1000 tables
figure was a lie, because \d lied, this is what it reported, this is
what I naively reported. The true figure was about 13000+ tables.

So all in all I lost about a little less than a day worth of data (not
the best experience I gotta say).

Bottom line :  this is clearly a bug.  Also wal_sender_timeout didn't
seem to work either. This wal sender should have exited (after 15 mins)
as no receiver was connected to it. I didn't have the resources to test
how it would behave in a more generic foreign table conf (e.g.
pgsql_fdw), to test if it would still give problems.

In response to

Browse pgsql-admin by date

  From Date Subject
Next Message ram pratap maurya 2020-06-17 21:05:17 old and new pg_controldata WAL segment sizes are invalid or do not match
Previous Message Peter Wainaina 2020-06-16 18:25:15 APCHE NIFI WITH NIFI