From: | Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> |
---|---|
To: | Noah Misch <noah(at)leadboat(dot)com> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Michael Paquier <michael(at)paquier(dot)xyz>, "osumi(dot)takamichi(at)fujitsu(dot)com" <osumi(dot)takamichi(at)fujitsu(dot)com>, "'amitlangote09(at)gmail(dot)com'" <amitlangote09(at)gmail(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Test of a partition with an incomplete detach has a timing issue |
Date: | 2021-05-25 15:32:38 |
Message-ID: | 20210525153238.GA17744@alvherre.pgsql |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
So I had a hard time reproducing the problem, until I realized that I
needed to limit the server to use only one CPU, and in addition run some
other stuff concurrently in the same server in order to keep it busy.
With that, I see about one failure every 10 runs.
So I start the server as "numactl -C0 postmaster", then another terminal
with an infinite loop doing "make -C src/test/regress installcheck-parallel";
and a third terminal doing this
while [ $? == 0 ]; do ../../../src/test/isolation/pg_isolation_regress --inputdir=/pgsql/source/master/src/test/isolation --outputdir=output_iso --bindir='/pgsql/install/master/bin' detach-partition-concurrently-3 detach-partition-concurrently-3 detach-partition-concurrently-3 detach-partition-concurrently-3 detach-partition-concurrently-3 detach-partition-concurrently-3 detach-partition-concurrently-3 detach-partition-concurrently-3 detach-partition-concurrently-3 detach-partition-concurrently-3 detach-partition-concurrently-3 detach-partition-concurrently-3 detach-partition-concurrently-3 ; done
With the test unpatched, I get about one failure in the set.
On 2021-May-24, Noah Misch wrote:
> What if we had a standard that the step after the cancel shall send a query to
> the backend that just received the cancel? Something like:
Hmm ... I don't understand why this fixes the problem, but it
drastically reduces the probability. Here's a complete patch. I got
about one failure in 1000 instead of 1 in 10. The new failure looks
like this:
diff -U3 /pgsql/source/master/src/test/isolation/expected/detach-partition-concurrently-3.out /home/alvherre/Code/pgsql-build/master/src/test/isolation/output_iso/results/detach-partition-concurrently-3.out
--- /pgsql/source/master/src/test/isolation/expected/detach-partition-concurrently-3.out 2021-05-25 11:12:42.333987835 -0400
+++ /home/alvherre/Code/pgsql-build/master/src/test/isolation/output_iso/results/detach-partition-concurrently-3.out 2021-05-25 11:19:03.714947775 -0400
@@ -13,7 +13,7 @@
t
step s2detach: <... completed>
-error in steps s1cancel s2detach: ERROR: canceling statement due to user request
+ERROR: canceling statement due to user request
step s2noop: UNLISTEN noop;
step s1c: COMMIT;
step s1describe: SELECT 'd3_listp' AS root, * FROM pg_partition_tree('d3_listp')
I find this a bit weird and I'm wondering if it could be an
isolationtester bug -- why is it not attributing the error message to
any steps?
The problem disappears completely if I add a sleep to the cancel query:
step "s1cancel" { SELECT pg_cancel_backend(pid), pg_sleep(0.01) FROM d3_pid; }
I suppose a 0.01 second sleep is not going to be sufficient to close the
problem in slower animals, but I hesitate to propose a much longer sleep
because this test has 18 permutations so even a one second sleep adds
quite a lot of (mostly useless) test runtime.
--
Álvaro Herrera 39°49'30"S 73°17'W
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2021-05-25 15:37:25 | Re: Test of a partition with an incomplete detach has a timing issue |
Previous Message | Tom Lane | 2021-05-25 15:20:26 | Re: CALL versus procedures with output-only arguments |