From: | Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com> |
---|---|
To: | Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Issues with logical replication |
Date: | 2017-10-02 16:59:41 |
Message-ID: | d0588e65-289c-9768-7a43-2e8ef4d05620@2ndquadrant.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 02/10/17 17:49, Konstantin Knizhnik wrote:
> I have faced two issues with logical replication.
> Reproducing scenario:
>
> 1. start two Postgres instances (I start both at local machine).
> 2. Initialize pgbench tables at both instances:
> pgbench -i -s 10 postgres
> 3. Create publication of pgbench_accounts table at one node:
> create publication pub for table pgbench_accounts;
> 4. Create subscription at another node:
> delete from pgbench_accounts;
> CREATE SUBSCRIPTION sub connection 'dbname=postgres host=localhost
> port=5432 sslmode=disable' publication pub;
> CREATE OR REPLACE FUNCTION replication_trigger_proc() RETURNS
> TRIGGER AS $$
> BEGIN
> -- RETURN NEW;
> END $$ LANGUAGE plpgsql;
> CREATE TRIGGER replication_trigger BEFORE INSERT OR UPDATE OR DELETE
> ON pgbench_accounts FOR EACH ROW EXECUTE PROCEDURE
> replication_trigger_proc();
> ALTER TABLE pgbench_accounts ENABLE ALWAYS TRIGGER replication_trigger;
> 5. Start pgbench at primary node:
> pgbench -T 1000 -P 2 -c 10 postgres
>
>
> Please notice commented "return new" statement. Invocation of this
> function cause error and in log we see repeated messages:
>
> 2017-10-02 16:47:46.764 MSK [32129] LOG: logical replication table
> synchronization worker for subscription "sub", table "pgbench_accounts"
> has started
> 2017-10-02 16:47:46.771 MSK [32129] ERROR: control reached end of
> trigger procedure without RETURN
> 2017-10-02 16:47:46.771 MSK [32129] CONTEXT: PL/pgSQL function
> replication_trigger_proc()
> COPY pgbench_accounts, line 1: "1 1 0 "
> 2017-10-02 16:47:46.772 MSK [28020] LOG: worker process: logical
> replication worker for subscription 17264 sync 17251 (PID 32129) exited
> with exit code 1
> ...
>
>
> After few minutes of work primary node (with publication) is crashed
> with the following stack trace:
>
> Program terminated with signal SIGABRT, Aborted.
> #0 0x00007f3608f8ec37 in __GI_raise (sig=sig(at)entry=6) at
> ../nptl/sysdeps/unix/sysv/linux/raise.c:56
> 56 ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
> (gdb) bt
> #0 0x00007f3608f8ec37 in __GI_raise (sig=sig(at)entry=6) at
> ../nptl/sysdeps/unix/sysv/linux/raise.c:56
> #1 0x00007f3608f92028 in __GI_abort () at abort.c:89
> #2 0x00000000009f5740 in ExceptionalCondition (conditionName=0xbf6b30
> "!(((xid) != ((TransactionId) 0)))",
> errorType=0xbf69af "FailedAssertion", fileName=0xbf69a8 "lmgr.c",
> lineNumber=582) at assert.c:54
> #3 0x000000000086ac1d in XactLockTableWait (xid=0, rel=0x0, ctid=0x0,
> oper=XLTW_None) at lmgr.c:582
> #4 0x000000000081c9f7 in SnapBuildWaitSnapshot (running=0x2749198,
> cutoff=898498) at snapbuild.c:1400
> #5 0x000000000081c7c7 in SnapBuildFindSnapshot (builder=0x2807910,
> lsn=798477760, running=0x2749198) at snapbuild.c:1311
> #6 0x000000000081c339 in SnapBuildProcessRunningXacts
> (builder=0x2807910, lsn=798477760, running=0x2749198) at snapbuild.c:1106
Hmm this would suggest that XLOG_RUNNING_XACTS record contains invalid
transaction ids but we specifically skip those in
GetRunningTransactionData(). Can you check pg_waldump output for that
record (lsn is shown in the backtrace)?
>
> Now fix the trigger function:
> CREATE OR REPLACE FUNCTION replication_trigger_proc() RETURNS TRIGGER AS $$
> BEGIN
> RETURN NEW;
> END $$ LANGUAGE plpgsql;
>
> And manually perform at master two updates inside one transaction:
>
> postgres=# begin;
> BEGIN
> postgres=# update pgbench_accounts set abalance=abalance+1 where aid=26;
> UPDATE 1
> postgres=# update pgbench_accounts set abalance=abalance-1 where aid=26;
> UPDATE 1
> postgres=# commit;
> <hangs>
>
> and in replica log we see:
> 2017-10-02 18:40:26.094 MSK [2954] LOG: logical replication apply
> worker for subscription "sub" has started
> 2017-10-02 18:40:26.101 MSK [2954] ERROR: attempted to lock invisible
> tuple
> 2017-10-02 18:40:26.102 MSK [2882] LOG: worker process: logical
> replication worker for subscription 16403 (PID 2954) exited with exit
> code 1
>
> Error happens in trigger.c:
>
> #3 0x000000000069bddb in GetTupleForTrigger (estate=0x2e36b50,
> epqstate=0x7ffc4420eda0, relinfo=0x2dcfe90, tid=0x2dd08ac,
> lockmode=LockTupleNoKeyExclusive, newSlot=0x7ffc4420ec40) at
> trigger.c:3103
> #4 0x000000000069b259 in ExecBRUpdateTriggers (estate=0x2e36b50,
> epqstate=0x7ffc4420eda0, relinfo=0x2dcfe90, tupleid=0x2dd08ac,
> fdw_trigtuple=0x0, slot=0x2dd0240) at trigger.c:2748
> #5 0x00000000006d2395 in ExecSimpleRelationUpdate (estate=0x2e36b50,
> epqstate=0x7ffc4420eda0, searchslot=0x2dd0358, slot=0x2dd0240)
> at execReplication.c:461
> #6 0x0000000000820894 in apply_handle_update (s=0x7ffc442163b0) at
> worker.c:736
We have locked the same tuple in RelationFindReplTupleByIndex() just
before this gets called and didn't get the same error. I guess we do
something wrong with snapshots. Will need to investigate more.
--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
From | Date | Subject | |
---|---|---|---|
Next Message | Alexander Korotkov | 2017-10-02 17:02:26 | Re: [BUG] Cache invalidation for queries that contains const of temporary composite type |
Previous Message | Alexander Korotkov | 2017-10-02 16:51:03 | Re: GSoC 2017: weekly progress reports (week 6) |