Re: postgres_fdw vs. force_parallel_mode on ppc

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Noah Misch <noah(at)leadboat(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: postgres_fdw vs. force_parallel_mode on ppc
Date: 2016-03-04 14:02:19
Message-ID: CA+Tgmoa+fjuesD1fC5Q2m_HQ1R_=8Dd7_YyzB8V_L6h=JMWD7w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Mar 4, 2016 at 12:46 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> A couple of my colleagues have been looking into this. It's not
>> entirely clear to me what's going on here yet, but it looks like the
>> stats get there if you wait long enough. Rahila Syed was able to
>> reproduce the problem and says that the attached patch fixes it. But
>> I don't quite understand why this should fix it.
>
> I don't like this patch much. While the new function is not bad in
> itself, it looks really weird to call it immediately after the other
> wait function. And the reason for that, AFAICT, is that somebody dropped
> the entire "truncation stats" test sequence into the middle of unrelated
> tests, evidently in the vain hope that that way they could piggyback
> on the existing wait. Which these failures are showing us is wrong.
>
> I think we should move all the inserted logic down so that it's not in the
> middle of unrelated testing.

Sure. If you have an idea what the right thing to do is, please go
ahead. I actually don't have a clear idea what's going on here. I
guess it's that the wait_for_stats() guarantees that the stats message
from the index insertion has been received but the status messages
from the "trunc" tables might not have gotten there yet. I thought
maybe that works without parallelism because all of those messages are
coming from the same backend, and therefore if you have the later one
you must have all of the earlier ones, too. But if you're running
some of the queries in parallel workers then it's possible for a stats
message from a worker run later to arrive later.

But it's not that after all, because when I run the regression tests
with the pg_sleep removed, I get this:

*** /Users/rhaas/pgsql/src/test/regress/expected/stats.out
2016-03-04 08:55:33.000000000 -0500
--- /Users/rhaas/pgsql/src/test/regress/results/stats.out
2016-03-04 09:00:29.000000000 -0500
***************
*** 127,140 ****
1
(1 row)

- -- force the rate-limiting logic in pgstat_report_tabstat() to time out
- -- and send a message
- SELECT pg_sleep(1.0);
- pg_sleep
- ----------
-
- (1 row)
-
-- wait for stats collector to update
SELECT wait_for_stats();
wait_for_stats
--- 127,132 ----
***************
*** 148,158 ****
WHERE relname like 'trunc_stats_test%' order by relname;
relname | n_tup_ins | n_tup_upd | n_tup_del | n_live_tup
| n_dead_tup
-------------------+-----------+-----------+-----------+------------+------------
! trunc_stats_test | 3 | 0 | 0 | 0
| 0
! trunc_stats_test1 | 4 | 2 | 1 | 1
| 0
! trunc_stats_test2 | 1 | 0 | 0 | 1
| 0
! trunc_stats_test3 | 4 | 0 | 0 | 2
| 2
! trunc_stats_test4 | 2 | 0 | 0 | 0
| 2
(5 rows)

SELECT st.seq_scan >= pr.seq_scan + 1,
--- 140,150 ----
WHERE relname like 'trunc_stats_test%' order by relname;
relname | n_tup_ins | n_tup_upd | n_tup_del | n_live_tup
| n_dead_tup
-------------------+-----------+-----------+-----------+------------+------------
! trunc_stats_test | 0 | 0 | 0 | 0
| 0
! trunc_stats_test1 | 0 | 0 | 0 | 0
| 0
! trunc_stats_test2 | 0 | 0 | 0 | 0
| 0
! trunc_stats_test3 | 0 | 0 | 0 | 0
| 0
! trunc_stats_test4 | 0 | 0 | 0 | 0
| 0
(5 rows)

SELECT st.seq_scan >= pr.seq_scan + 1,
***************
*** 163,169 ****
WHERE st.relname='tenk2' AND cl.relname='tenk2';
?column? | ?column? | ?column? | ?column?
----------+----------+----------+----------
! t | t | t | t
(1 row)

SELECT st.heap_blks_read + st.heap_blks_hit >= pr.heap_blks + cl.relpages,
--- 155,161 ----
WHERE st.relname='tenk2' AND cl.relname='tenk2';
?column? | ?column? | ?column? | ?column?
----------+----------+----------+----------
! f | f | f | f
(1 row)

SELECT st.heap_blks_read + st.heap_blks_hit >= pr.heap_blks + cl.relpages,
***************
*** 172,178 ****
WHERE st.relname='tenk2' AND cl.relname='tenk2';
?column? | ?column?
----------+----------
! t | t
(1 row)

SELECT pr.snap_ts < pg_stat_get_snapshot_timestamp() as snapshot_newer
--- 164,170 ----
WHERE st.relname='tenk2' AND cl.relname='tenk2';
?column? | ?column?
----------+----------
! t | f
(1 row)

SELECT pr.snap_ts < pg_stat_get_snapshot_timestamp() as snapshot_newer

That looks suspiciously similar to the failure we're getting with the
force_parallel_mode testing, but I'm still confused.

>> BTW, this comment is obsolete:
>
>> -- force the rate-limiting logic in pgstat_report_tabstat() to time out
>> -- and send a message
>> SELECT pg_sleep(1.0);
>> pg_sleep
>> ----------
>
>> (1 row)
>
>> That function was renamed in commit
>> 93c701edc6c6f065cd25f77f63ab31aff085d6ac, but apparently Tom forgot to
>> grep for other uses of that identifier name.
>
> Duh :-(. Actually, do we need that sleep at all anymore? Seems like
> wait_for_stats ought to cover it.

Yeah.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2016-03-04 14:23:54 Re: VS 2015 support in src/tools/msvc
Previous Message Greg Stark 2016-03-04 14:01:08 Re: WIP: Upper planner pathification