Re: postgres_fdw uninterruptible during connection establishment / ProcSignalBarrier

From: Robins Tharakan <tharakan(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Nathan Bossart <nathandbossart(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Fujii Masao <fujii(at)postgresql(dot)org>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
Subject: Re: postgres_fdw uninterruptible during connection establishment / ProcSignalBarrier
Date: 2023-01-31 00:34:58
Message-ID: CAEP4nAyyH-kMfWPLDMwKT6BFv6-dA734221Uywi6DxLmx0L1NQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Andres,

On Tue, 31 Jan 2023 at 06:31, Nathan Bossart <nathandbossart(at)gmail(dot)com> wrote:
>
> On Mon, Jan 30, 2023 at 11:49:37AM -0800, Andres Freund wrote:
> > Why don't the dblink tests catch this? Any chance you or Robins could prepare
> > a patch with fix and test, given that you know how to trigger this?
>
> It's trivially reproducible by calling 1-argument dblink_connect() multiple
> times and then calling dblink_disconnect(). Here's a patch.
>

My test instance has been running Nathan's patch for the past
few hours, and it looks like this should resolve the issue.

A little bit of background, since the past few days I was
noticing frequent crashes on a test instance. They're not simple
to reproduce manually, but if left running I can reliably see
crashes on an ~hourly basis.

In trying to trace the origin, I ran a multiple-hour test for
each commit going back a few commits and noticed that the
crashes stopped prior to commit e4602483e9, which is when the
issue became visible.

The point is now moot, but let me know if full backtraces still
help (I was concerned looking at the excerpts from some
of the crashes):

"double free or corruption (out)"
"munmap_chunk(): invalid pointer"
"free(): invalid pointer"
str->data[0] = '\0';

Some backtraces
###############

Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: 6c6d6ba3ee(at)master@sqith: u21 postgres
127.0.0.1(45334) SELECT '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 __GI___libc_free (mem=0x312f352f65736162) at malloc.c:3102
#0 __GI___libc_free (mem=0x312f352f65736162) at malloc.c:3102
#1 0x00007fc0062dfefd in pqDropConnection (conn=0x564bb3e1a080,
flushInput=true) at fe-connect.c:495
#2 0x00007fc0062e5cb3 in closePGconn (conn=0x564bb3e1a080)
at fe-connect.c:4112
#3 0x00007fc0062e5d55 in PQfinish (conn=0x564bb3e1a080) at fe-connect.c:4134
#4 0x00007fc0061a442b in libpqsrv_disconnect (conn=0x564bb3e1a080)
at ../../src/include/libpq/libpq-be-fe-helpers.h:117
#5 0x00007fc0061a4df1 in dblink_disconnect (fcinfo=0x564bb3d980f0)
at dblink.c:357
#6 0x0000564bb0e70aa7 in ExecInterpExpr (state=0x564bb3d98018,
econtext=0x564bb3d979a0, isnull=0x7ffd60824b0f) at execExprInterp.c:728
#7 0x0000564bb0e72f36 in ExecInterpExprStillValid (state=0x564bb3d98018,
econtext=0x564bb3d979a0, isNull=0x7ffd60824b0f) at execExprInterp.c:1838

============

Core was generated by `postgres: 6c6d6ba3ee(at)master@sqith: u52 postgres
127.0.0.1(58778) SELECT '.
Program terminated with signal SIGABRT, Aborted.
#0 __GI_raise (sig=sig(at)entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#0 __GI_raise (sig=sig(at)entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x00007fc021792859 in __GI_abort () at abort.c:79
#2 0x00007fc0217fd26e in __libc_message (action=action(at)entry=do_abort,
fmt=fmt(at)entry=0x7fc021927298 "%s\n") at ../sysdeps/posix/libc_fatal.c:155
#3 0x00007fc0218052fc in malloc_printerr (
str=str(at)entry=0x7fc021929670 "double free or corruption (out)")
at malloc.c:5347
#4 0x00007fc021806fa0 in _int_free (av=0x7fc02195cb80 <main_arena>,
p=0x7fc02195cbf0 <main_arena+112>, have_lock=<optimized out>)
at malloc.c:4314
#5 0x00007fc0062e16ed in freePGconn (conn=0x564bb6e36b80)
at fe-connect.c:3977
#6 0x00007fc0062e1d61 in PQfinish (conn=0x564bb6e36b80) at fe-connect.c:4135
#7 0x00007fc0062a142b in libpqsrv_disconnect (conn=0x564bb6e36b80)
at ../../src/include/libpq/libpq-be-fe-helpers.h:117
#8 0x00007fc0062a1df1 in dblink_disconnect (fcinfo=0x564bb5647b58)
at dblink.c:357

============

Program terminated with signal SIGSEGV, Segmentation fault.
#0 resetPQExpBuffer (str=0x559d3af0e838) at pqexpbuffer.c:153
153 str->data[0] = '\0';
#0 resetPQExpBuffer (str=0x559d3af0e838) at pqexpbuffer.c:153
#1 0x00007f2240b0a876 in PQexecStart (conn=0x559d3af0e410) at fe-exec.c:2320
#2 0x00007f2240b0a688 in PQexec (conn=0x559d3af0e410,
query=0x559d56fb8ee8 "p3") at fe-exec.c:2227
#3 0x00007f223ba8c7e4 in dblink_exec (fcinfo=0x559d3b101f58) at dblink.c:1432
#4 0x0000559d2f003c82 in ExecInterpExpr (state=0x559d3b101ac0,
econtext=0x559d34e76578, isnull=0x7ffe3d590fdf) at execExprInterp.c:752

============

Core was generated by `postgres: 728f86fec6@(HEAD detached at
728f86fec6)@sqith: u19 postgres 127.0.0.'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00007f96f5fc6e99 in SSL_shutdown ()
from /lib/x86_64-linux-gnu/libssl.so.1.1
#0 0x00007f96f5fc6e99 in SSL_shutdown ()
from /lib/x86_64-linux-gnu/libssl.so.1.1
#1 0x00007f96da56027a in pgtls_close (conn=0x55d919752fb0)
at fe-secure-openssl.c:1555
#2 0x00007f96da558e41 in pqsecure_close (conn=0x55d919752fb0)
at fe-secure.c:192
#3 0x00007f96da53dd12 in pqDropConnection (conn=0x55d919752fb0,
flushInput=true) at fe-connect.c:449
#4 0x00007f96da543cb3 in closePGconn (conn=0x55d919752fb0)
at fe-connect.c:4112
#5 0x00007f96da543d55 in PQfinish (conn=0x55d919752fb0) at fe-connect.c:4134
#6 0x00007f96d9ebd42b in libpqsrv_disconnect (conn=0x55d919752fb0)
at ../../src/include/libpq/libpq-be-fe-helpers.h:117
#7 0x00007f96d9ebddf1 in dblink_disconnect (fcinfo=0x55d91f2692a8)
at dblink.c:357

============

Program terminated with signal SIGABRT, Aborted.
#0 __GI_raise (sig=sig(at)entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#0 __GI_raise (sig=sig(at)entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x00007f5f6b632859 in __GI_abort () at abort.c:79
#2 0x00007f5f6b69d26e in __libc_message (action=action(at)entry=do_abort,
fmt=fmt(at)entry=0x7f5f6b7c7298 "%s\n") at ../sysdeps/posix/libc_fatal.c:155
#3 0x00007f5f6b6a52fc in malloc_printerr (
str=str(at)entry=0x7f5f6b7c91e0 "munmap_chunk(): invalid pointer")
at malloc.c:5347
#4 0x00007f5f6b6a554c in munmap_chunk (p=<optimized out>) at malloc.c:2830
#5 0x00007f5f50085efd in pqDropConnection (conn=0x55d12ebcd100,
flushInput=true) at fe-connect.c:495
#6 0x00007f5f5008bcb3 in closePGconn (conn=0x55d12ebcd100)
at fe-connect.c:4112
#7 0x00007f5f5008bd55 in PQfinish (conn=0x55d12ebcd100) at fe-connect.c:4134
#8 0x00007f5f5006c42b in libpqsrv_disconnect (conn=0x55d12ebcd100)
at ../../src/include/libpq/libpq-be-fe-helpers.h:117

============

Program terminated with signal SIGABRT, Aborted.
#0 __GI_raise (sig=sig(at)entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#0 __GI_raise (sig=sig(at)entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x00007f5f6b632859 in __GI_abort () at abort.c:79
#2 0x00007f5f6b69d26e in __libc_message (action=action(at)entry=do_abort,
fmt=fmt(at)entry=0x7f5f6b7c7298 "%s\n") at ../sysdeps/posix/libc_fatal.c:155
#3 0x00007f5f6b6a52fc in malloc_printerr (
str=str(at)entry=0x7f5f6b7c54c1 "free(): invalid pointer") at malloc.c:5347
#4 0x00007f5f6b6a6b2c in _int_free (av=<optimized out>, p=<optimized out>,
have_lock=0) at malloc.c:4173
#5 0x00007f5f500fe6ed in freePGconn (conn=0x55d142273000)
at fe-connect.c:3977
#6 0x00007f5f500fed61 in PQfinish (conn=0x55d142273000) at fe-connect.c:4135
#7 0x00007f5f501de42b in libpqsrv_disconnect (conn=0x55d142273000)
at ../../src/include/libpq/libpq-be-fe-helpers.h:117
#8 0x00007f5f501dedf1 in dblink_disconnect (fcinfo=0x55d1527998f8)
at dblink.c:357

============

Core was generated by `postgres: e4602483e9@(HEAD detached at
e4602483e9)@sqith: u73 postgres 127.0.0.'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 __GI___libc_realloc (oldmem=0x7f7f7f7f7f7f7f7f, bytes=2139070335)
at malloc.c:3154
#0 __GI___libc_realloc (oldmem=0x7f7f7f7f7f7f7f7f, bytes=2139070335)
at malloc.c:3154
#1 0x00007fb7bc0a580a in pqCheckOutBufferSpace (bytes_needed=2139062148,
conn=0x55b191aa9380) at fe-misc.c:329
#2 0x00007fb7bc0a5b1c in pqPutMsgStart (msg_type=88 'X', conn=0x55b191aa9380)
at fe-misc.c:476
#3 0x00007fb7bc097c60 in sendTerminateConn (conn=0x55b191aa9380)
at fe-connect.c:4076
#4 0x00007fb7bc097c97 in closePGconn (conn=0x55b191aa9380)
at fe-connect.c:4096
#5 0x00007fb7bc097d55 in PQfinish (conn=0x55b191aa9380) at fe-connect.c:4134
#6 0x00007fb7bc14a42b in libpqsrv_disconnect (conn=0x55b191aa9380)
at ../../src/include/libpq/libpq-be-fe-helpers.h:117
#7 0x00007fb7bc14adf1 in dblink_disconnect (fcinfo=0x55b193894f00)
at dblink.c:357

============

Thanks to SQLSmith for helping with this find.

-
Robins Tharakan
Amazon Web Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2023-01-31 01:00:05 Re: pg_upgrade test failure
Previous Message Justin Pryzby 2023-01-31 00:24:58 Re: Introduce "log_connection_stages" setting.