Skip site navigation (1) Skip section navigation (2)

Instability in select_parallel regression test

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Instability in select_parallel regression test
Date: 2017-02-17 05:52:14
Message-ID: 30673.1487310734@sss.pgh.pa.us (view raw, whole thread or download thread mbox)
Thread:
Lists: pgsql-hackers
Buildfarm members gaur, pademelon, and gharial have all recently shown
failures like this:

*** /home/bfarm/bf-data/HEAD/pgsql.build/src/test/regress/expected/select_parallel.out	Thu Feb 16 20:35:14 2017
--- /home/bfarm/bf-data/HEAD/pgsql.build/src/test/regress/results/select_parallel.out	Thu Feb 16 21:17:58 2017
***************
*** 163,167 ****
  -- provoke error in worker
  select stringu1::int2 from tenk1 where unique1 = 1;
  ERROR:  invalid input syntax for integer: "BAAAAA"
- CONTEXT:  parallel worker
  rollback;
--- 163,166 ----

======================================================================

The failure appears intermittent on gharial but is quite reproducible
on gaur/pademelon.  I inserted some debug elog calls and got this trace
of events:

2017-02-17 00:28:32.641 EST [18934] LOG:  RegisterDynamicBackgroundWorker: grabbing slot 1
2017-02-17 00:28:32.641 EST [18934] STATEMENT:  select count(*) from a_star;
2017-02-17 00:28:32.643 EST [18934] LOG:  LaunchParallelWorkers: registered 1 of 1 workers
2017-02-17 00:28:32.643 EST [18934] STATEMENT:  select count(*) from a_star;
2017-02-17 00:28:32.731 EST [18934] LOG:  RegisterDynamicBackgroundWorker: grabbing slot 2
2017-02-17 00:28:32.731 EST [18934] STATEMENT:  select length(stringu1) from tenk1 group by length(stringu1);
2017-02-17 00:28:32.824 EST [18934] LOG:  LaunchParallelWorkers: registered 1 of 4 workers
2017-02-17 00:28:32.824 EST [18934] STATEMENT:  select length(stringu1) from tenk1 group by length(stringu1);
2017-02-17 00:28:32.824 EST [18934] LOG:  RegisterDynamicBackgroundWorker: grabbing slot 3
2017-02-17 00:28:32.824 EST [18934] STATEMENT:  select length(stringu1) from tenk1 group by length(stringu1);
2017-02-17 00:28:32.923 EST [18934] LOG:  LaunchParallelWorkers: registered 2 of 4 workers
2017-02-17 00:28:32.923 EST [18934] STATEMENT:  select length(stringu1) from tenk1 group by length(stringu1);
2017-02-17 00:28:32.923 EST [18934] LOG:  RegisterDynamicBackgroundWorker: grabbing slot 4
2017-02-17 00:28:32.923 EST [18934] STATEMENT:  select length(stringu1) from tenk1 group by length(stringu1);
2017-02-17 00:28:32.989 EST [18934] LOG:  LaunchParallelWorkers: registered 3 of 4 workers
2017-02-17 00:28:32.989 EST [18934] STATEMENT:  select length(stringu1) from tenk1 group by length(stringu1);
2017-02-17 00:28:32.989 EST [18934] LOG:  RegisterDynamicBackgroundWorker: grabbing slot 5
2017-02-17 00:28:32.989 EST [18934] STATEMENT:  select length(stringu1) from tenk1 group by length(stringu1);
2017-02-17 00:28:33.040 EST [18934] LOG:  LaunchParallelWorkers: registered 4 of 4 workers
2017-02-17 00:28:33.040 EST [18934] STATEMENT:  select length(stringu1) from tenk1 group by length(stringu1);
2017-02-17 00:28:33.093 EST [18934] LOG:  RegisterDynamicBackgroundWorker: grabbing slot 6
2017-02-17 00:28:33.093 EST [18934] STATEMENT:  select count(*) from tenk1 where (two, four) not in
		(select hundred, thousand from tenk2 where thousand > 100);
2017-02-17 00:28:33.173 EST [18934] LOG:  LaunchParallelWorkers: registered 1 of 4 workers
2017-02-17 00:28:33.173 EST [18934] STATEMENT:  select count(*) from tenk1 where (two, four) not in
		(select hundred, thousand from tenk2 where thousand > 100);
2017-02-17 00:28:33.173 EST [18934] LOG:  RegisterDynamicBackgroundWorker: grabbing slot 7
2017-02-17 00:28:33.173 EST [18934] STATEMENT:  select count(*) from tenk1 where (two, four) not in
		(select hundred, thousand from tenk2 where thousand > 100);
2017-02-17 00:28:33.253 EST [18934] LOG:  LaunchParallelWorkers: registered 2 of 4 workers
2017-02-17 00:28:33.253 EST [18934] STATEMENT:  select count(*) from tenk1 where (two, four) not in
		(select hundred, thousand from tenk2 where thousand > 100);
2017-02-17 00:28:33.253 EST [18934] LOG:  RegisterDynamicBackgroundWorker: no free slots of 8
2017-02-17 00:28:33.253 EST [18934] STATEMENT:  select count(*) from tenk1 where (two, four) not in
		(select hundred, thousand from tenk2 where thousand > 100);
2017-02-17 00:28:33.254 EST [18934] LOG:  LaunchParallelWorkers: failed to register 3 of 4 workers
2017-02-17 00:28:33.254 EST [18934] STATEMENT:  select count(*) from tenk1 where (two, four) not in
		(select hundred, thousand from tenk2 where thousand > 100);
2017-02-17 00:28:33.557 EST [18934] LOG:  RegisterDynamicBackgroundWorker: no free slots of 8
2017-02-17 00:28:33.557 EST [18934] STATEMENT:  select  count((unique1)) from tenk1 where hundred > 1;
2017-02-17 00:28:33.557 EST [18934] LOG:  LaunchParallelWorkers: failed to register 1 of 4 workers
2017-02-17 00:28:33.557 EST [18934] STATEMENT:  select  count((unique1)) from tenk1 where hundred > 1;
2017-02-17 00:28:33.703 EST [18934] LOG:  RegisterDynamicBackgroundWorker: no free slots of 8
2017-02-17 00:28:33.703 EST [18934] STATEMENT:  select stringu1::int2 from tenk1 where unique1 = 1;
2017-02-17 00:28:33.703 EST [18934] LOG:  LaunchParallelWorkers: failed to register 1 of 1 workers
2017-02-17 00:28:33.703 EST [18934] STATEMENT:  select stringu1::int2 from tenk1 where unique1 = 1;
2017-02-17 00:28:33.704 EST [18934] ERROR:  invalid input syntax for integer: "BAAAAA"
2017-02-17 00:28:33.704 EST [18934] STATEMENT:  select stringu1::int2 from tenk1 where unique1 = 1;
2017-02-17 00:28:33.854 EST [18336] LOG:  ForgetBackgroundWorker: unregistering background worker "parallel worker for PID 18934" slot 7
2017-02-17 00:28:33.854 EST [18336] LOG:  ForgetBackgroundWorker: unregistering background worker "parallel worker for PID 18934" slot 6
2017-02-17 00:28:33.854 EST [18336] LOG:  ForgetBackgroundWorker: unregistering background worker "parallel worker for PID 18934" slot 5
2017-02-17 00:28:33.854 EST [18336] LOG:  ForgetBackgroundWorker: unregistering background worker "parallel worker for PID 18934" slot 4
2017-02-17 00:28:33.854 EST [18336] LOG:  ForgetBackgroundWorker: unregistering background worker "parallel worker for PID 18934" slot 3
2017-02-17 00:28:33.854 EST [18336] LOG:  ForgetBackgroundWorker: unregistering background worker "parallel worker for PID 18934" slot 2
2017-02-17 00:28:33.855 EST [18336] LOG:  ForgetBackgroundWorker: unregistering background worker "parallel worker for PID 18934" slot 1

In short, it looks to me like ExecShutdownGatherWorkers doesn't actually
wait for parallel workers to finish (as its comment suggests is
necessary), so that on not-too-speedy machines the worker slots may all
still be in use when the next command wants some.  This does not seem
like a good thing from a reproducibility or error-detection standpoint.
It is *certainly* not what you would expect from reading any of the
comments in the vicinity.

I don't know if this is a generic logic failure or if there's something
platform-specific about it --- it's suspicious that it's only showing up
on ia64 and hppa architectures.

I also find it curious that nobody had seen this at all until the past
couple days.  I suspect that 51ee6f316 changed the number of parallel
workers demanded by the select_parallel regression test, which is, shall
we say, not what its commit log message would suggest.

			regards, tom lane


Responses

pgsql-hackers by date

Next:From: Ashutosh BapatDate: 2017-02-17 05:55:47
Subject: Re: Documentation improvements for partitioning
Previous:From: Peter GeogheganDate: 2017-02-17 05:50:45
Subject: Re: Partitioning vs ON CONFLICT

Privacy Policy | About PostgreSQL
Copyright © 1996-2017 The PostgreSQL Global Development Group