BUG #16199: pg_restore stuck on interrupts

From: PG Bug reporting form <noreply(at)postgresql(dot)org>
To: pgsql-bugs(at)lists(dot)postgresql(dot)org
Cc: git(at)rmr(dot)ninja
Subject: BUG #16199: pg_restore stuck on interrupts
Date: 2020-01-08 16:17:57
Message-ID: 16199-cb2f121146a96f9b@postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

The following bug has been logged on the website:

Bug reference: 16199
Logged by: Raúl Marín
Email address: git(at)rmr(dot)ninja
PostgreSQL version: 11.5
Operating system: Linux
Description:

We are seeing stuck pg_restore processes in several of our CI servers, both
with PG10 (10.2) and PG11 (11.5).
These servers test an external application which uses pg_restore quite often
to set up test environments.

A sample call might look like this:
```
/usr/lib/postgresql/10/bin/pg_restore -e --verbose -j4 --disable-triggers
-Fc
/tmp/user_imports/1113d139-22c6-4dd9-8ae6-836343ad6854/data/org_cd2b2fb0-96f9-476f-9589-b1d5e0b93957.dump
-U postgres -h localhost -p 5432 -d
cartodb_test_user_684a0497-9358-49ae-89b7-5feea5a6b7d7_db
--section=post-data
--use-list=/tmp/user_imports/1113d139-22c6-4dd9-8ae6-836343ad6854/data/user_org44000054-admin.list
```

With the appropriate symbols I can attach to the process and extract the
callstack (not always fully available):
```
(gdb) bt
#0 __lll_lock_wait_private () at
../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
#1 0x00007f4b03e152bd in _int_free (av=0x7f4b04159b20 <main_arena>,
p=0x55d43590e3b0, have_lock=0) at malloc.c:3962
#2 0x00007f4b03e1953c in __GI___libc_free (mem=<optimized out>) at
malloc.c:2968
#3 0x00007f4b01bbf464 in _gnutls_ext_deinit () at gnutls_extensions.c:405
#4 0x00007f4b01bc6720 in _gnutls_global_deinit (destructor=1) at
gnutls_global.c:389
#5 0x00007f4b047eede7 in _dl_fini () at dl-fini.c:235
#6 0x00007f4b03dceff8 in __run_exit_handlers (status=status(at)entry=1,
listp=0x7f4b041595f8 <__exit_funcs>,
run_list_atexit=run_list_atexit(at)entry=true)
at exit.c:82
#7 0x00007f4b03dcf045 in __GI_exit (status=status(at)entry=1) at exit.c:104
#8 0x000055d4337dbb5e in sigTermHandler (postgres_signal_arg=<optimized
out>)
at
/build/postgresql-10-bx1JAx/postgresql-10-10.1.2+carto/build/../src/bin/pg_dump/parallel.c:613
#9 <signal handler called>
#10 _int_free (av=0x7f4b04159b20 <main_arena>, p=<optimized out>,
have_lock=0) at malloc.c:4017
#11 0x00007f4b03e1953c in __GI___libc_free (mem=<optimized out>) at
malloc.c:2968
#12 0x00007f4b03e1054b in __GI__IO_setb (f=f(at)entry=0x55d435914240,
b=b(at)entry=0x0, eb=eb(at)entry=0x0, a=a(at)entry=0) at genops.c:382
#13 0x00007f4b03e0e8ee in _IO_new_file_close_it (fp=fp(at)entry=0x55d435914240)
at fileops.c:185
#14 0x00007f4b03e023ef in _IO_new_fclose (fp=0x55d435914240) at
iofclose.c:58
#15 0x000055d4337d798e in _ReopenArchive (AH=0x55d435aa94e0)
at
/build/postgresql-10-bx1JAx/postgresql-10-10.1.2+carto/build/../src/bin/pg_dump/pg_backup_custom.c:763
#16 0x000055d4337dc3af in RunWorker (slot=0x55d435ada4b8,
AH=0x55d435aa94e0)
at
/build/postgresql-10-bx1JAx/postgresql-10-10.1.2+carto/build/../src/bin/pg_dump/parallel.c:866
#17 ParallelBackupStart (AH=AH(at)entry=0x55d43590fe00) at
/build/postgresql-10-bx1JAx/postgresql-10-10.1.2+carto/build/../src/bin/pg_dump/parallel.c:1027
#18 0x000055d4337d4418 in RestoreArchive (AHX=0x55d43590fe00)
at
/build/postgresql-10-bx1JAx/postgresql-10-10.1.2+carto/build/../src/bin/pg_dump/pg_backup_archiver.c:649
#19 0x000055d4337ce9ca in main (argc=17, argv=<optimized out>)
at
/build/postgresql-10-bx1JAx/postgresql-10-10.1.2+carto/build/../src/bin/pg_dump/pg_restore.c:429
```

I have a some extra processes with the same issue (7 full stacks out of 20,
the others are garbage) and, from what I see, they all have in common that
the process has received a signal while it was doing a memory operation,
either a malloc or a free:

* Malloc example:
```
#8 0x00005634a8ebeb5e in sigTermHandler (postgres_signal_arg=<optimized
out>)
at
/build/postgresql-10-bx1JAx/postgresql-10-10.1.2+carto/build/../src/bin/pg_dump/parallel.c:613
#9 <signal handler called>
#10 _int_malloc (av=av(at)entry=0x7fce15142b20 <main_arena>,
bytes=bytes(at)entry=17584) at malloc.c:3644
#11 0x00007fce14e02184 in __GI___libc_malloc (bytes=17584) at
malloc.c:2913
#12 0x00007fce14732e78 in CRYPTO_malloc () from
/lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#13 0x00007fce14b3dafd in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.0.0
#14 0x00007fce14b3db68 in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.0.0
#15 0x00007fce14b3fb6c in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.0.0
#16 0x00007fce155a4fb5 in open_client_SSL (conn=0x5634a9a1a370)
```

* Free example:
```
#8 0x0000563455b60e7e in sigTermHandler (postgres_signal_arg=<optimized
out>)
at
/build/postgresql-11-3orpls/postgresql-11-11.5.2+carto/build/../src/bin/pg_dump/parallel.c:615
#9 <signal handler called>
#10 0x00007fbba21bb8e4 in _int_free (av=0x7fbba24ffb20 <main_arena>,
p=0x5634579993b0, have_lock=0) at malloc.c:3976
#11 0x00007fbba21bf53c in __GI___libc_free (mem=<optimized out>) at
malloc.c:2968
#12 0x0000563455b6140e in set_archive_cancel_info
(AH=AH(at)entry=0x5634579ad350, conn=conn(at)entry=0x0)
```

After that, the process gives control to `sigTermHandler` which calls
`exit(1)` to die, but that gives control to whoever has setup atexit
functions, in our case `gnutls`. Once gnutls tries to call `free` as part of
its cleanup process, it will try to acquire the lock and block, but since
the lock is being held by the function interrupted by the signal the process
self locks and hangs forever.

I think if would be safer to use a similar approach to other processes, that
is use the handler to only enable a global flag and check that in the main
loop, but I'm having a hard time locating what the proper place to check the
flag would be.

Regards,
Raúl Marín.

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2020-01-08 17:13:35 Re: BUG #16199: pg_restore stuck on interrupts
Previous Message Dave Cramer 2020-01-08 14:21:32 Re: publication not found, yet it is there.??