Instability of partition_prune regression test results

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Instability of partition_prune regression test results
Date: 2019-09-26 22:25:25
Message-ID: 11952.1569536725@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Every so often the partition_prune test falls over, for example
here, here, and here:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2019-08-15%2021%3A45%3A00
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=anole&dt=2019-08-21%2022%3A19%3A23
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2019-09-11%2018%3A46%3A47

The reason for the failures is quite apparent: we sometimes don't
get as many workers as we hoped for. The test script is not quite
100% naive about that, but it's only designed to filter out the
"loops" counts of parallel scan nodes. As these examples show,
that's utterly inadequate. The "Workers Launched" field is variable
too, obviously, and so are the rows and loops counts for every plan
node up to the Gather.

I experimented with adjusting explain_parallel_append() to filter
more fields, but soon realized that we'd have to filter out basically
everything that makes it useful to run EXPLAIN ANALYZE at all.

Therefore, I think it's time to give up this testing methodology
as a bad idea, and fall back to the time-honored way of running a
plain EXPLAIN and then the actual query, as per the attached patch.

(Note: there's some roughly similar code in select_parallel.sql,
but as far as I could find it fails seldom if at all. Likely that
is because we don't run select_parallel in parallel with other
test scripts. So perhaps an argument could be made to leave
partition_prune.sql alone and just run it by itself. I do not
care for that answer though, as it will make the regression test
suite slower, plus I do not see any argument that this testing method
actually provides any info we don't get the traditional way.)

BTW, another aspect of this test script that could stand to be
nuked from orbit is this method for getting a custom plan:

-- Execute query 5 times to allow choose_custom_plan
-- to start considering a generic plan.
execute ab_q4 (1, 8);
execute ab_q4 (1, 8);
execute ab_q4 (1, 8);
execute ab_q4 (1, 8);
execute ab_q4 (1, 8);

We should drop that in favor of plan_cache_mode = force_custom_plan,
IMO. But I didn't include that change in this patch.

regards, tom lane

Attachment Content-Type Size
change-parallel-append-testing.patch text/x-diff 30.7 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2019-09-26 22:33:33 Re: range test for hash index?
Previous Message Mike Palmiotto 2019-09-26 22:03:24 Re: Auxiliary Processes and MyAuxProc