Re: Parallel Append subplan order instability on aye-aye

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel Append subplan order instability on aye-aye
Date: 2019-07-16 01:12:32
Message-ID: 4174.1563239552@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:
> So that data-collection patch has been in place for nearly 2 months
> (since 2019-05-21), and in that time we've seen a grand total of
> no repeats of the original problem, as far as I've seen.

Oh ... wait a minute. I decided to go scrape the buildfarm logs to
confirm my impression that there were no matching failures, and darn
if I didn't find one:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana&dt=2019-06-04%2021%3A00%3A22

For the archives' sake, that's a pg_upgradeCheck failure, and here
are the regression diffs:

=========================== regression.diffs ================
diff -w -U3 c:/mingw/msys/1.0/home/pgrunner/bf/root/HEAD/pgsql.build/../pgsql/src/test/regress/expected/select_parallel.out c:/mingw/msys/1.0/home/pgrunner/bf/root/HEAD/pgsql.build/src/bin/pg_upgrade/tmp_check/regress/results/select_parallel.out
--- c:/mingw/msys/1.0/home/pgrunner/bf/root/HEAD/pgsql.build/../pgsql/src/test/regress/expected/select_parallel.out 2019-05-21 14:00:23 -0400
+++ c:/mingw/msys/1.0/home/pgrunner/bf/root/HEAD/pgsql.build/src/bin/pg_upgrade/tmp_check/regress/results/select_parallel.out 2019-06-04 17:42:27 -0400
@@ -21,12 +21,12 @@
Workers Planned: 3
-> Partial Aggregate
-> Parallel Append
+ -> Parallel Seq Scan on a_star
-> Parallel Seq Scan on d_star
-> Parallel Seq Scan on f_star
-> Parallel Seq Scan on e_star
-> Parallel Seq Scan on b_star
-> Parallel Seq Scan on c_star
- -> Parallel Seq Scan on a_star
(11 rows)

select round(avg(aa)), sum(aa) from a_star a1;
@@ -49,10 +49,10 @@
-> Parallel Append
-> Seq Scan on d_star
-> Seq Scan on c_star
+ -> Parallel Seq Scan on a_star
-> Parallel Seq Scan on f_star
-> Parallel Seq Scan on e_star
-> Parallel Seq Scan on b_star
- -> Parallel Seq Scan on a_star
(11 rows)

select round(avg(aa)), sum(aa) from a_star a2;
@@ -75,12 +75,12 @@
Workers Planned: 3
-> Partial Aggregate
-> Parallel Append
+ -> Seq Scan on a_star
-> Seq Scan on d_star
-> Seq Scan on f_star
-> Seq Scan on e_star
-> Seq Scan on b_star
-> Seq Scan on c_star
- -> Seq Scan on a_star
(11 rows)

select round(avg(aa)), sum(aa) from a_star a3;
@@ -95,7 +95,7 @@
where relname like '__star' order by relname;
relname | relpages | reltuples
---------+----------+-----------
- a_star | 1 | 3
+ a_star | 0 | 0
b_star | 1 | 4
c_star | 1 | 4
d_star | 1 | 16
diff -w -U3 c:/mingw/msys/1.0/home/pgrunner/bf/root/HEAD/pgsql.build/../pgsql/src/test/regress/expected/stats.out c:/mingw/msys/1.0/home/pgrunner/bf/root/HEAD/pgsql.build/src/bin/pg_upgrade/tmp_check/regress/results/stats.out
--- c:/mingw/msys/1.0/home/pgrunner/bf/root/HEAD/pgsql.build/../pgsql/src/test/regress/expected/stats.out 2019-05-21 14:00:23 -0400
+++ c:/mingw/msys/1.0/home/pgrunner/bf/root/HEAD/pgsql.build/src/bin/pg_upgrade/tmp_check/regress/results/stats.out 2019-06-04 17:43:06 -0400
@@ -205,7 +205,7 @@
where relname like '__star' order by relname;
relname | relpages | reltuples
---------+----------+-----------
- a_star | 1 | 3
+ a_star | 0 | 0
b_star | 1 | 4
c_star | 1 | 4
d_star | 1 | 16

This plan shape change matches some, though by no means all, of the
previous failures. And we can now see why the planner did that: a_star
has a smaller recorded size than the other tables in the query.

So what happened there? There's no diff in the pg_stat_all_tables
query, which proves that a vacuum on a_star did happen, since it
transmitted a vacuum_count increment to the stats collector.
It seems like there are two possible theories:

(1) The vacuum for some reason saw the table's size as zero
(whereupon it'd read no blocks and count no tuples).
(2) The vacuum's update of the pg_class row failed to "take".

Theory (2) seems a bit more plausible, but still very unsettling.

The similar failures that this result doesn't exactly match
all look, in the light of this data, like some one of the "X_star"
tables unexpectedly moved to the top of the parallel plan, which
we can now hypothesize means that that table had zero relpages/
reltuples after supposedly being vacuumed. So it's not only
a_star that's got the issue, which lets out my half-formed theory
that being the topmost parent of the inheritance hierarchy has
something to do with it. But I bet that these tables forming
an inheritance hierarchy (with multiple inheritance even) does
have something to do with it somehow, because if this were a
generic VACUUM bug surely we'd be seeing it elsewhere.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2019-07-16 01:15:38 Re: Change ereport level for QueuePartitionConstraintValidation
Previous Message Bruce Momjian 2019-07-16 01:09:36 Re: doc: mention pg_reload_conf() in pg_hba.conf documentation