Re: postgres_fdw: perform UPDATE/DELETE .. RETURNING on a join directly

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc: Joe Conway <mail(at)joeconway(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Robert Haas <robertmhaas(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Rushabh Lathia <rushabh(dot)lathia(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Subject: Re: postgres_fdw: perform UPDATE/DELETE .. RETURNING on a join directly
Date: 2018-03-05 19:19:12
Message-ID: 25502.1520277552@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:
> The thing that I find curious, now that we've shut off autovacuum
> altogether on those tables, is that we *still* aren't getting stable
> results. How can that be?

I spent some time trying to figure out what's going on here. I've still
not been able to replicate the failure on any of my machines, but I have
learned a thing or two.

On the first query that's still failing on rhinoceros (and has also been
seen to fail on other machines, before the last round of changes), which
is at line 1142 in postgres_fdw.sql in current HEAD:

EXPLAIN (verbose, costs off)
UPDATE ft2 SET c3 = 'baz'
FROM ft4 INNER JOIN ft5 ON (ft4.c1 = ft5.c1)
WHERE ft2.c1 > 2000 AND ft2.c2 === ft4.c1
RETURNING ft2.*, ft4.*, ft5.*; -- can't be pushed down

(call this Q1), we are expecting to get a plan of the shape of

-> Nested Loop
-> Foreign Scan on public.ft2
Remote SQL: SELECT "C 1", c2, c4, c5, c6, c7, c8, ctid FROM "S 1"."T 1" WHERE (("C 1" > 2000)) FOR UPDATE
-> Foreign Scan
Relations: (public.ft4) INNER JOIN (public.ft5)

Now, there is no possible way that the planner would pick that plan if its
estimate of the number of rows out of the ft2 scan was more than 1.
Re-executing the foreign join would have high enough overhead to push the
plan to some other shape. In fact, probably the shape we see in the
actual plan choice, on the failing machines:

-> Nested Loop
-> Foreign Scan
Relations: (public.ft4) INNER JOIN (public.ft5)
-> Materialize
-> Foreign Scan on public.ft2
Remote SQL: SELECT "C 1", c2, c4, c5, c6, c7, c8, ctid FROM "S 1"."T 1" WHERE (("C 1" > 2000)) FOR UPDATE

I instrumented costsize.c, and determined that the estimated size of
"S 1"."T 1" WHERE (("C 1" > 2000)), before the clamp_row_est() rounding,
is only 0.1159 rows on my machine. So there's no way to explain the plan
change as the result of small platform-specific roundoff differences ---
we need something more than a 12X change in the selectivity estimate
before the plan shape would change like this. There are a bunch of things
going on under the hood, such as that the table's raw rowcount estimate
gets scaled up from the original 1000 rows because it's now got more pages
than before, but none come close to explaining 12X.

However, the planner is working with quite old statistics, dating back to
the manual ANALYZE at postgres_fdw.sql line 93. If I stick in another
manual ANALYZE just before Q1, I get exactly the same plan change reported
by rhinoceros. (And underneath, the rowcount estimate for ft2 has gone
from 1 row to 8 rows, which is much closer to the true value of 10 rows,
so the plan change is not surprising.) What's more, doing this also
reproduces the one other plan change seen later in rhinoceros' output.

It is, therefore, very hard to avoid the conclusion that something is
causing an ANALYZE to happen while the script runs, despite our fooling
about with the table's reloptions. I'm not sure that that something is
autovacuum. A platform-specific bug in reloptions handling doesn't seem
out of the question, but poking around in the code didn't spot anything
obvious.

Joe, I wonder if you could add "log_autovacuum_min_duration = 0" to
rhinoceros' extra_config options, temporarily? Correlating that log
output with the log_statement output from the test proper would let
us confirm or deny whether it's autovacuum.

Another thing I'd like to do is temporarily add

select relpages, reltuples from pg_class where relname = 'T 1';

to the test script, both just after the manual ANALYZE and just before Q1.
If we see a change between those reports on any of the affected machines,
we'll know that *something* is changing the stats. Now the problem with
doing that is that the expected value of relpages is platform-dependent
(I see 11 on 64-bit and 10 on 32-bit on my machines). We can work around
that, perhaps by capturing the initial value in a temp table and printing
only the delta, but I'm not sure if it's worth the effort as opposed to
just letting it fail on 32-bit critters for a day or two.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Steele 2018-03-05 19:21:36 Re: [HACKERS] PoC: custom signal handler for extensions
Previous Message Robert Haas 2018-03-05 18:57:15 Re: parallel append vs. simple UNION ALL