Re: stress test for parallel workers

From: Justin Pryzby <pryzby(at)telsasoft(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Subject: Re: stress test for parallel workers
Date: 2019-10-22 22:03:58
Message-ID: 20191022220358.GA29301@telsasoft.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I want to give some conclusion to our occurance of this, which now I think was
neither an instance nor indicitive of any bug. Summary: postgres was being
kill -9 by a deployment script, after it "timed out". Thus no log messages.

I initially experienced this while testing migration of a customer's DB using
an ansible script, which did "async: 2222, poll: 0". Which I guess it takes to
mean "let the process run for 2222 seconds, after which send it SIGKILL". I
probably made multiple attempts to migrate (for example to fix tablespaces or
dependences on postgis), and the process was started during one test and never
stopped nor restarted during following tests, until it finally hit 2222sec and
stopped ungracefully.

This just happened again, during a differnet migration, so I reproduced it like:

ansible --sudo --sudo-user pryzbyj 'database.customer.*' --background=1 --poll=1 -m command -a '/usr/pgsql-12/bin/postmaster -D /home/pryzbyj/pg12.dat -c port=5678 -c unix-socket-directories=/tmp -c client_min_messages=debug -c log_temp_files=0 -c log_lock_waits=1'

Connect to customer and verify it was killed uncleanly:

[pryzbyj(at)database ~]$ /usr/pgsql-12/bin/postmaster -D ./pg12.dat -c port=5678 -c unix-socket-directories=/tmp -c client_min_messages=debug -c log_temp_files=0 -c log_lock_waits=1 -c logging_collector=off
2019-10-22 17:57:58.251 EDT [5895] FATAL: lock file "postmaster.pid" already exists
2019-10-22 17:57:58.251 EDT [5895] HINT: Is another postmaster (PID 5608) running in data directory "/home/pryzbyj/./pg12.dat"?
[pryzbyj(at)database ~]$ /usr/pgsql-12/bin/postmaster -D ./pg12.dat -c port=5678 -c unix-socket-directories=/tmp -c client_min_messages=debug -c log_temp_files=0 -c log_lock_waits=1 -c logging_collector=off
2019-10-22 17:58:01.312 EDT [5962] LOG: starting PostgreSQL 12.0 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64-bit
2019-10-22 17:58:01.312 EDT [5962] LOG: listening on IPv6 address "::1", port 5678
2019-10-22 17:58:01.312 EDT [5962] LOG: listening on IPv4 address "127.0.0.1", port 5678
2019-10-22 17:58:01.328 EDT [5962] LOG: listening on Unix socket "/tmp/.s.PGSQL.5678"
2019-10-22 17:58:01.353 EDT [5963] LOG: database system was interrupted; last known up at 2019-10-22 17:57:48 EDT
2019-10-22 17:58:01.460 EDT [5963] LOG: database system was not properly shut down; automatic recovery in progress
2019-10-22 17:58:01.478 EDT [5963] LOG: invalid record length at 6/E829D128: wanted 24, got 0
2019-10-22 17:58:01.478 EDT [5963] LOG: redo is not required
2019-10-22 17:58:01.526 EDT [5962] LOG: database system is ready to accept connections

On Tue, Jul 23, 2019 at 11:27:03AM -0500, Justin Pryzby wrote:
> Does anyone have a stress test for parallel workers ?
>
> On a customer's new VM, I got this several times while (trying to) migrate their DB:
>
> < 2019-07-23 10:33:51.552 CDT postgres >FATAL: postmaster exited during a parallel transaction
> < 2019-07-23 10:33:51.552 CDT postgres >STATEMENT: CREATE UNIQUE INDEX unused0_huawei_umts_nodeb_locell_201907_unique_idx ON child.unused0_huawei_umts_nodeb_locell_201907 USING btree ...
>
> There's nothing in dmesg nor in postgres logs.
>
> At first I thought it's maybe because of a network disconnection, then I
> thought it's because we ran out of space (wal), then they had a faulty blade.
> After that, I'd tried and failed to reproduce it a number of times, but it's
> just recurred during what was intended to be their final restore. I've set
> max_parallel_workers_per_gather=0, but I'm planning to try to diagnose an issue
> in another instance. Ideally a minimal test, since I'm apparently going to
> have to run under gdb to see how it's dying, or even what process is failing.
>
> DMI: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/21/2015
> CentOS release 6.9 (Final)
> Linux alextelsasrv01 2.6.32-754.17.1.el6.x86_64 #1 SMP Tue Jul 2 12:42:48 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
> version | PostgreSQL 11.4 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-23), 64-bit

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2019-10-22 22:49:53 Re: Bug about drop index concurrently
Previous Message Andres Freund 2019-10-22 19:43:13 Re: Declaring a strict function returns not null / eval speed