From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
Cc: | David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Parallel Append subplan order instability on aye-aye |
Date: | 2019-07-16 01:12:32 |
Message-ID: | 4174.1563239552@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
I wrote:
> So that data-collection patch has been in place for nearly 2 months
> (since 2019-05-21), and in that time we've seen a grand total of
> no repeats of the original problem, as far as I've seen.
Oh ... wait a minute. I decided to go scrape the buildfarm logs to
confirm my impression that there were no matching failures, and darn
if I didn't find one:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana&dt=2019-06-04%2021%3A00%3A22
For the archives' sake, that's a pg_upgradeCheck failure, and here
are the regression diffs:
=========================== regression.diffs ================
diff -w -U3 c:/mingw/msys/1.0/home/pgrunner/bf/root/HEAD/pgsql.build/../pgsql/src/test/regress/expected/select_parallel.out c:/mingw/msys/1.0/home/pgrunner/bf/root/HEAD/pgsql.build/src/bin/pg_upgrade/tmp_check/regress/results/select_parallel.out
--- c:/mingw/msys/1.0/home/pgrunner/bf/root/HEAD/pgsql.build/../pgsql/src/test/regress/expected/select_parallel.out 2019-05-21 14:00:23 -0400
+++ c:/mingw/msys/1.0/home/pgrunner/bf/root/HEAD/pgsql.build/src/bin/pg_upgrade/tmp_check/regress/results/select_parallel.out 2019-06-04 17:42:27 -0400
@@ -21,12 +21,12 @@
Workers Planned: 3
-> Partial Aggregate
-> Parallel Append
+ -> Parallel Seq Scan on a_star
-> Parallel Seq Scan on d_star
-> Parallel Seq Scan on f_star
-> Parallel Seq Scan on e_star
-> Parallel Seq Scan on b_star
-> Parallel Seq Scan on c_star
- -> Parallel Seq Scan on a_star
(11 rows)
select round(avg(aa)), sum(aa) from a_star a1;
@@ -49,10 +49,10 @@
-> Parallel Append
-> Seq Scan on d_star
-> Seq Scan on c_star
+ -> Parallel Seq Scan on a_star
-> Parallel Seq Scan on f_star
-> Parallel Seq Scan on e_star
-> Parallel Seq Scan on b_star
- -> Parallel Seq Scan on a_star
(11 rows)
select round(avg(aa)), sum(aa) from a_star a2;
@@ -75,12 +75,12 @@
Workers Planned: 3
-> Partial Aggregate
-> Parallel Append
+ -> Seq Scan on a_star
-> Seq Scan on d_star
-> Seq Scan on f_star
-> Seq Scan on e_star
-> Seq Scan on b_star
-> Seq Scan on c_star
- -> Seq Scan on a_star
(11 rows)
select round(avg(aa)), sum(aa) from a_star a3;
@@ -95,7 +95,7 @@
where relname like '__star' order by relname;
relname | relpages | reltuples
---------+----------+-----------
- a_star | 1 | 3
+ a_star | 0 | 0
b_star | 1 | 4
c_star | 1 | 4
d_star | 1 | 16
diff -w -U3 c:/mingw/msys/1.0/home/pgrunner/bf/root/HEAD/pgsql.build/../pgsql/src/test/regress/expected/stats.out c:/mingw/msys/1.0/home/pgrunner/bf/root/HEAD/pgsql.build/src/bin/pg_upgrade/tmp_check/regress/results/stats.out
--- c:/mingw/msys/1.0/home/pgrunner/bf/root/HEAD/pgsql.build/../pgsql/src/test/regress/expected/stats.out 2019-05-21 14:00:23 -0400
+++ c:/mingw/msys/1.0/home/pgrunner/bf/root/HEAD/pgsql.build/src/bin/pg_upgrade/tmp_check/regress/results/stats.out 2019-06-04 17:43:06 -0400
@@ -205,7 +205,7 @@
where relname like '__star' order by relname;
relname | relpages | reltuples
---------+----------+-----------
- a_star | 1 | 3
+ a_star | 0 | 0
b_star | 1 | 4
c_star | 1 | 4
d_star | 1 | 16
This plan shape change matches some, though by no means all, of the
previous failures. And we can now see why the planner did that: a_star
has a smaller recorded size than the other tables in the query.
So what happened there? There's no diff in the pg_stat_all_tables
query, which proves that a vacuum on a_star did happen, since it
transmitted a vacuum_count increment to the stats collector.
It seems like there are two possible theories:
(1) The vacuum for some reason saw the table's size as zero
(whereupon it'd read no blocks and count no tuples).
(2) The vacuum's update of the pg_class row failed to "take".
Theory (2) seems a bit more plausible, but still very unsettling.
The similar failures that this result doesn't exactly match
all look, in the light of this data, like some one of the "X_star"
tables unexpectedly moved to the top of the parallel plan, which
we can now hypothesize means that that table had zero relpages/
reltuples after supposedly being vacuumed. So it's not only
a_star that's got the issue, which lets out my half-formed theory
that being the topmost parent of the inheritance hierarchy has
something to do with it. But I bet that these tables forming
an inheritance hierarchy (with multiple inheritance even) does
have something to do with it somehow, because if this were a
generic VACUUM bug surely we'd be seeing it elsewhere.
regards, tom lane
From | Date | Subject | |
---|---|---|---|
Next Message | David Rowley | 2019-07-16 01:15:38 | Re: Change ereport level for QueuePartitionConstraintValidation |
Previous Message | Bruce Momjian | 2019-07-16 01:09:36 | Re: doc: mention pg_reload_conf() in pg_hba.conf documentation |