Re: Segfault when a replica trigger tries to refresh subscription

From: Andres Freund <andres(at)anarazel(dot)de>
To: Jeremy Finzel <finzelj(at)gmail(dot)com>, Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: Segfault when a replica trigger tries to refresh subscription
Date: 2020-11-11 21:58:20
Message-ID: 20201111215820.qihhrz7fayu6myfi@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi,

On 2020-11-11 12:23:06 -0600, Jeremy Finzel wrote:
> I have only attempted to reproduce this on pg 11.9.

It's reproducible on HEAD.

> Logs:
> 2020-11-11 18:14:09.535 GMT,,,21636,,5fa58af6.5484,8583,,2020-11-06
> 17:42:14 GMT,,0,LOG,00000,"background worker ""logical replication worker""
> (PID 24492) wa
> s terminated by signal 11: Segmentation fault",,,,,,,,,""

The backtrace is

#0 __GI___libc_free (mem=0x7f7f7f7f7f7f7f7f) at malloc.c:3102
#1 0x00007ffbef93cea8 in PQfreemem (ptr=0x7f7f7f7f7f7f7f7f) at /home/andres/src/postgresql/src/interfaces/libpq/fe-exec.c:3298
#2 0x00007ffbf462e4f7 in libpqrcv_receive (conn=0x5637f8d91740, buffer=0x7ffd7d7fd770, wait_fd=0x7ffd7d7fd778)
at /home/andres/src/postgresql/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c:735
#3 0x00005637f773ec76 in LogicalRepApplyLoop (last_received=23489672) at /home/andres/src/postgresql/src/backend/replication/logical/worker.c:2173
#4 0x00005637f7740733 in ApplyWorkerMain (main_arg=0) at /home/andres/src/postgresql/src/backend/replication/logical/worker.c:3091
#5 0x00005637f76f11dd in StartBackgroundWorker () at /home/andres/src/postgresql/src/backend/postmaster/bgworker.c:820
#6 0x00005637f770672d in do_start_bgworker (rw=0x5637f8caf500) at /home/andres/src/postgresql/src/backend/postmaster/postmaster.c:5837
#7 0x00005637f7706b01 in maybe_start_bgworkers () at /home/andres/src/postgresql/src/backend/postmaster/postmaster.c:6062
#8 0x00005637f7705a68 in sigusr1_handler (postgres_signal_arg=10) at /home/andres/src/postgresql/src/backend/postmaster/postmaster.c:5223
#9 <signal handler called>
#10 0x00007ffbf395d7c6 in __GI___select (nfds=8, readfds=0x7ffd7d7fe770, writefds=0x0, exceptfds=0x0, timeout=0x7ffd7d7fe7f0)
at ../sysdeps/unix/sysv/linux/select.c:41
#11 0x00005637f77012ca in ServerLoop () at /home/andres/src/postgresql/src/backend/postmaster/postmaster.c:1691
#12 0x00005637f7700c67 in PostmasterMain (argc=42, argv=0x5637f8c81e20) at /home/andres/src/postgresql/src/backend/postmaster/postmaster.c:1400
#13 0x00005637f7601934 in main (argc=42, argv=0x5637f8c81e20) at /home/andres/src/postgresql/src/backend/main/main.c:209

i.e. the connection is freed at this point.

The problem seems to be that nobody thought about refresh being called
in an apply worker. Adding a check for refresh overwriting the worker's
connection yields:

#0 __GI_raise (sig=sig(at)entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x00007f2242be5537 in __GI_abort () at abort.c:79
#2 0x000055e28c1a7f80 in errfinish (filename=0x7f2243987065 "libpqwalreceiver.c", lineno=706, funcname=0x7f2243987870 <__func__.5> "libpqrcv_disconnect")
at /home/andres/src/postgresql/src/backend/utils/error/elog.c:592
#3 0x00007f224398647e in libpqrcv_disconnect (conn=0x55e28d2c4750)
at /home/andres/src/postgresql/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c:706
#4 0x000055e28bd7aa4f in AlterSubscription_refresh (sub=0x55e28d2c3f40, copy_data=true)
at /home/andres/src/postgresql/src/backend/commands/subscriptioncmds.c:583
#5 0x000055e28bd7b4fe in AlterSubscription (stmt=0x55e28d2e0188) at /home/andres/src/postgresql/src/backend/commands/subscriptioncmds.c:880
#6 0x000055e28c021e5e in ProcessUtilitySlow (pstate=0x55e28d2c3e28, pstmt=0x55e28d2e00f0,
queryString=0x55e28d2bc448 "ALTER SUBSCRIPTION foo REFRESH PUBLICATION", context=PROCESS_UTILITY_QUERY, params=0x0, queryEnv=0x0,
dest=0x55e28c4c5980 <spi_printtupDR>, qc=0x7ffe5de60840) at /home/andres/src/postgresql/src/backend/tcop/utility.c:1828
#7 0x000055e28c0204db in standard_ProcessUtility (pstmt=0x55e28d2e00f0, queryString=0x55e28d2bc448 "ALTER SUBSCRIPTION foo REFRESH PUBLICATION",
context=PROCESS_UTILITY_QUERY, params=0x0, queryEnv=0x0, dest=0x55e28c4c5980 <spi_printtupDR>, qc=0x7ffe5de60840)
at /home/andres/src/postgresql/src/backend/tcop/utility.c:1071
#8 0x000055e28c01f595 in ProcessUtility (pstmt=0x55e28d2e00f0, queryString=0x55e28d2bc448 "ALTER SUBSCRIPTION foo REFRESH PUBLICATION",
context=PROCESS_UTILITY_QUERY, params=0x0, queryEnv=0x0, dest=0x55e28c4c5980 <spi_printtupDR>, qc=0x7ffe5de60840)
at /home/andres/src/postgresql/src/backend/tcop/utility.c:524
#9 0x000055e28be3833e in _SPI_execute_plan (plan=0x55e28d2dfc50, paramLI=0x0, snapshot=0x0, crosscheck_snapshot=0x0, read_only=false, fire_triggers=true,
tcount=0, caller_dest=0x0) at /home/andres/src/postgresql/src/backend/executor/spi.c:2443
#10 0x000055e28be34d29 in SPI_execute_plan_with_paramlist (plan=0x55e28d2dfc50, params=0x0, read_only=false, tcount=0)
at /home/andres/src/postgresql/src/backend/executor/spi.c:578
#11 0x00007f223eb91611 in exec_stmt_execsql (estate=0x7ffe5de60ca0, stmt=0x55e28d2c7fa0) at /home/andres/src/postgresql/src/pl/plpgsql/src/pl_exec.c:4288
#12 0x00007f223eb8ccb6 in exec_stmts (estate=0x7ffe5de60ca0, stmts=0x55e28d2c7ff8) at /home/andres/src/postgresql/src/pl/plpgsql/src/pl_exec.c:2055
#13 0x00007f223eb8c88c in exec_stmt_block (estate=0x7ffe5de60ca0, block=0x55e28d2c80d0) at /home/andres/src/postgresql/src/pl/plpgsql/src/pl_exec.c:1906
#14 0x00007f223eb8c09c in exec_toplevel_block (estate=0x7ffe5de60ca0, block=0x55e28d2c80d0) at /home/andres/src/postgresql/src/pl/plpgsql/src/pl_exec.c:1604
#15 0x00007f223eb8abe9 in plpgsql_exec_trigger (func=0x55e28d1e60d8, trigdata=0x7ffe5de610c0) at /home/andres/src/postgresql/src/pl/plpgsql/src/pl_exec.c:1019
#16 0x00007f223eba4c46 in plpgsql_call_handler (fcinfo=0x7ffe5de61040) at /home/andres/src/postgresql/src/pl/plpgsql/src/pl_handler.c:256
#17 0x000055e28bda9d6b in ExecCallTriggerFunc (trigdata=0x7ffe5de610c0, tgindx=0, finfo=0x55e28d29cd30, instr=0x0, per_tuple_context=0x55e28d2b4120)
at /home/andres/src/postgresql/src/backend/commands/trigger.c:2077
#18 0x000055e28bdaa2ed in ExecBRInsertTriggers (estate=0x55e28d2b2230, relinfo=0x55e28d29bc60, slot=0x55e28d29b858)
at /home/andres/src/postgresql/src/backend/commands/trigger.c:2206
#19 0x000055e28bde890c in ExecSimpleRelationInsert (resultRelInfo=0x55e28d29bc60, estate=0x55e28d2b2230, slot=0x55e28d29b858)
at /home/andres/src/postgresql/src/backend/executor/execReplication.c:422
#20 0x000055e28bf94f29 in apply_handle_insert_internal (relinfo=0x55e28d29bc60, estate=0x55e28d2b2230, remoteslot=0x55e28d29b858)
at /home/andres/src/postgresql/src/backend/replication/logical/worker.c:1214
#21 0x000055e28bf94e9d in apply_handle_insert (s=0x7ffe5de61290) at /home/andres/src/postgresql/src/backend/replication/logical/worker.c:1190
#22 0x000055e28bf96659 in apply_dispatch (s=0x7ffe5de61290) at /home/andres/src/postgresql/src/backend/replication/logical/worker.c:1912
#23 0x000055e28bf96bab in LogicalRepApplyLoop (last_received=23489672) at /home/andres/src/postgresql/src/backend/replication/logical/worker.c:2150

It doesn't at all seem necessary for AlterSubscription_refresh() to use
the global 'wrconn' variable. This code has basically survived
unmodified since
commit 7c4f52409a8c7d85ed169bbbc1f6092274d03920
Author: Peter Eisentraut <peter_e(at)gmx(dot)net>
Date: 2017-03-23 08:36:36 -0400

Logical replication support for initial data copy

The crash is trivial to fix - simply use a local variable instead of
wrconn. There's a few more places than what you hit here that need to be
dealt with though. Obviously the code also ought to reset wrconn to NULL
after freeing.

But uh, isn't this code also just plainly in danger of leaking libpq
connections in case something goes wrong?

/* Try to connect to the publisher. */
conn = walrcv_connect(sub->conninfo, true, sub->name, &err);
if (!conn)
ereport(ERROR,
(errmsg("could not connect to the publisher: %s", err)));

/* Get the table list from publisher. */
pubrel_names = fetch_table_list(conn, sub->publications);

/* We are done with the remote side, close connection. */
walrcv_disconnect(conn);

What if fetch_table_list() errors out? Oddly other cases seem to pay
some attention to this, see e.g. DropSubscription().

Greetings,

Andres Freund

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Bruce Momjian 2020-11-11 22:12:25 Re: BUG #16708: Please add some more restrictions to the default systemd unit files
Previous Message Stephen Frost 2020-11-11 21:53:00 Re: BUG #16708: Please add some more restrictions to the default systemd unit files