Issues with logical replication

From: Konstantin Knizhnik <knizhnik(at)development(dot)mcobject(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Issues with logical replication
Date: 2017-10-02 15:47:48
Message-ID: 26f99fe7-564d-9b2a-cf9e-ccb70ecf0ac2@development.mcobject.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I have faced two issues with logical replication.
Repro scenario:

1. start two Postgres instances (I start both at local machine).
2. Initialize pgbench tables at both instances:
pgbench -i -s 10 postgres
3. Create publication of pgbench_accounts table at one node:
create publication pub for table pgbench_accounts;
4. Create subscription at another node:
delete from pgbench_accounts;
CREATE SUBSCRIPTION sub connection 'dbname=postgres host=localhost
port=5432 sslmode=disable' publication pub;
CREATE OR REPLACE FUNCTION replication_trigger_proc() RETURNS
TRIGGER AS $$
BEGIN
-- RETURN NEW;
END $$ LANGUAGE plpgsql;
CREATE TRIGGER replication_trigger BEFORE INSERT OR UPDATE OR DELETE
ON pgbench_accounts FOR EACH ROW EXECUTE PROCEDURE
replication_trigger_proc();
ALTER TABLE pgbench_accounts ENABLE ALWAYS TRIGGER replication_trigger;
5. Start pgbench at primary node:
pgbench -T 1000 -P 2 -c 10 postgres

Please notice commented "return new" statement. Invocation of this
function cause error and in log we see repeated messages:

2017-10-02 16:47:46.764 MSK [32129] LOG: logical replication table
synchronization worker for subscription "sub", table "pgbench_accounts"
has started
2017-10-02 16:47:46.771 MSK [32129] ERROR: control reached end of
trigger procedure without RETURN
2017-10-02 16:47:46.771 MSK [32129] CONTEXT: PL/pgSQL function
replication_trigger_proc()
COPY pgbench_accounts, line 1: "1 1 0 "
2017-10-02 16:47:46.772 MSK [28020] LOG: worker process: logical
replication worker for subscription 17264 sync 17251 (PID 32129) exited
with exit code 1
...

After few minutes of work primary node (with publication) is crashed
with the following stack trace:

Program terminated with signal SIGABRT, Aborted.
#0 0x00007f3608f8ec37 in __GI_raise (sig=sig(at)entry=6) at
../nptl/sysdeps/unix/sysv/linux/raise.c:56
56 ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0 0x00007f3608f8ec37 in __GI_raise (sig=sig(at)entry=6) at
../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1 0x00007f3608f92028 in __GI_abort () at abort.c:89
#2 0x00000000009f5740 in ExceptionalCondition (conditionName=0xbf6b30
"!(((xid) != ((TransactionId) 0)))",
errorType=0xbf69af "FailedAssertion", fileName=0xbf69a8 "lmgr.c",
lineNumber=582) at assert.c:54
#3 0x000000000086ac1d in XactLockTableWait (xid=0, rel=0x0, ctid=0x0,
oper=XLTW_None) at lmgr.c:582
#4 0x000000000081c9f7 in SnapBuildWaitSnapshot (running=0x2749198,
cutoff=898498) at snapbuild.c:1400
#5 0x000000000081c7c7 in SnapBuildFindSnapshot (builder=0x2807910,
lsn=798477760, running=0x2749198) at snapbuild.c:1311
#6 0x000000000081c339 in SnapBuildProcessRunningXacts
(builder=0x2807910, lsn=798477760, running=0x2749198) at snapbuild.c:1106
#7 0x000000000080a1c7 in DecodeStandbyOp (ctx=0x27ef870,
buf=0x7ffd301858d0) at decode.c:314
#8 0x0000000000809ce1 in LogicalDecodingProcessRecord (ctx=0x27ef870,
record=0x27efb30) at decode.c:117
#9 0x000000000080ddf9 in DecodingContextFindStartpoint (ctx=0x27ef870)
at logical.c:458
#10 0x0000000000823968 in CreateReplicationSlot (cmd=0x27483a8) at
walsender.c:934
#11 0x00000000008246ee in exec_replication_command (
cmd_string=0x27b9520 "CREATE_REPLICATION_SLOT
\"sub_17264_sync_17251\" TEMPORARY LOGICAL pgoutput USE_SNAPSHOT") at
walsender.c:1511
#12 0x000000000088eb44 in PostgresMain (argc=1, argv=0x275b738,
dbname=0x275b578 "postgres", username=0x272b800 "knizhnik")
at postgres.c:4086
#13 0x00000000007ef9a1 in BackendRun (port=0x27532a0) at postmaster.c:4357
#14 0x00000000007ef0cb in BackendStartup (port=0x27532a0) at
postmaster.c:4029
#15 0x00000000007eb68b in ServerLoop () at postmaster.c:1753
#16 0x00000000007eac77 in PostmasterMain (argc=3, argv=0x2729670) at
postmaster.c:1361
#17 0x0000000000728552 in main (argc=3, argv=0x2729670) at main.c:228

Now fix the trigger function:
CREATE OR REPLACE FUNCTION replication_trigger_proc() RETURNS TRIGGER AS $$
BEGIN
RETURN NEW;
END $$ LANGUAGE plpgsql;

And manually perform at master two updates inside one transaction:

postgres=# begin;
BEGIN
postgres=# update pgbench_accounts set abalance=abalance+1 where aid=26;
UPDATE 1
postgres=# update pgbench_accounts set abalance=abalance-1 where aid=26;
UPDATE 1
postgres=# commit;
<hangs>

and in replcas log we see:
2017-10-02 18:40:26.094 MSK [2954] LOG: logical replication apply
worker for subscription "sub" has started
2017-10-02 18:40:26.101 MSK [2954] ERROR: attempted to lock invisible tuple
2017-10-02 18:40:26.102 MSK [2882] LOG: worker process: logical
replication worker for subscription 16403 (PID 2954) exited with exit code 1

Error happens in trigger.c:
#3 0x000000000069bddb in GetTupleForTrigger (estate=0x2e36b50,
epqstate=0x7ffc4420eda0, relinfo=0x2dcfe90, tid=0x2dd08ac,
lockmode=LockTupleNoKeyExclusive, newSlot=0x7ffc4420ec40) at
trigger.c:3103
#4 0x000000000069b259 in ExecBRUpdateTriggers (estate=0x2e36b50,
epqstate=0x7ffc4420eda0, relinfo=0x2dcfe90, tupleid=0x2dd08ac,
fdw_trigtuple=0x0, slot=0x2dd0240) at trigger.c:2748
#5 0x00000000006d2395 in ExecSimpleRelationUpdate (estate=0x2e36b50,
epqstate=0x7ffc4420eda0, searchslot=0x2dd0358, slot=0x2dd0240)
at execReplication.c:461
#6 0x0000000000820894 in apply_handle_update (s=0x7ffc442163b0) at
worker.c:736
#7 0x0000000000820d83 in apply_dispatch (s=0x7ffc442163b0) at worker.c:892
#8 0x000000000082121b in LogicalRepApplyLoop (last_received=2225693736)
at worker.c:1095
#9 0x000000000082219d in ApplyWorkerMain (main_arg=0) at worker.c:1647
#10 0x00000000007dd496 in StartBackgroundWorker () at bgworker.c:835
#11 0x00000000007f0889 in do_start_bgworker (rw=0x2d75f50) at
postmaster.c:5680
#12 0x00000000007f0bc3 in maybe_start_bgworkers () at postmaster.c:5884
#13 0x00000000007efbeb in sigusr1_handler (postgres_signal_arg=10) at
postmaster.c:5073
#14 <signal handler called>
#15 0x00007fbe83a054a3 in __select_nocancel () at
../sysdeps/unix/syscall-template.S:81
#16 0x00000000007eb517 in ServerLoop () at postmaster.c:1717
#17 0x00000000007eac48 in PostmasterMain (argc=3, argv=0x2d4e660) at
postmaster.c:1361
#18 0x0000000000728523 in main (argc=3, argv=0x2d4e660) at main.c:228

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Gustafsson 2017-10-02 15:48:02 Commitfest 201709 is now closed
Previous Message Alexander Korotkov 2017-10-02 15:24:30 Re: [PATCH]make pg_rewind to not copy useless WAL files