Re: Foreign join pushdown vs EvalPlanQual

From: Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Shigeru Hanada <shigeru(dot)hanada(at)gmail(dot)com>
Subject: Re: Foreign join pushdown vs EvalPlanQual
Date: 2015-10-19 04:17:56
Message-ID: 9A28C8860F777E439AA12E8AEA7694F80115A11D@BPXM15GP.gisp.nec.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On Fri, Oct 16, 2015 at 6:12 PM, Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com> wrote:
> > I think, it is right approach to pretend EPQ doesn't exist if scanrelid==0,
> > because what replaced by these ForeignScan/CustomScan node are local join
> > node like NestLoop. They don't have its own EPQ slot, but constructs joined-
> > tuple based on the underlying scan-tuple originally stored within EPQ slots.
>
> I think you've got that backwards. The fact that they don't have
> their own EPQ slot is the problem we need to solve. When an EPQ
> recheck happens, we rescan every relation in the query. Each relation
> needs to return 0 or 1 tuples. If it returns a tuple, the tuple it
> returns must be either the same tuple it previously returned, or an
> updated version of that tuple. But "the tuple it previously returned"
> does not necessarily mean the tuple it returned most recently. It
> means the tuple that it returned which, when passed through the rest
> of the plan, contributed to generate the result tuple that is being
> rechecked.
>
Yes, it is the reason why citd or whole-var (if early row locking) or
something rowid (if later row locking) are required to fill up EPQ slot
of base relations.
I understand the tuple returned most recently is not answer here.
(E.g, in case when ForeignScan is located under MergeJoin)

> Now, if you don't have an EPQ slot, how are you going to do this?
> When the EPQ machinery engages, you need to somehow get the tuple you
> previously returned stored someplace. And the first time thereafter
> that you get called by ExecProcNode, you need to return that tuple,
> provided that it still passes the quals. The second time you get
> called, and any subsequent times, you need to return an empty slot.
> The EPQ slot is well-suited to this task. It's got a TupleTableSlot
> to store the tuple you need to return, and it's got a flag indicating
> whether you've already returned that tuple. So you're good.
>
> But with Etsuro Fujita's patch, and I think what you have proposed has
> been similar, how are you going to do it? The proposal is to call the
> recheck method and hope for the best, but what is the recheck method
> going to do? Where is it going to get the previously-returned tuple?
> How will it know if it has already returned it during the lifetime of
> this EPQ check? Offhand, it looks to me like, at least in some
> circumstances, you're probably going to return whatever tuple you
> returned most recently (which has a good chance of being the right
> one, but not necessarily) over and over again. That's not going to
> fly.
>
I think the job of recheck method to do "hope for the best" is below.

1. Fetch every EPQ slot of base relations involved in this join.
In case of ForeignScan, all the required tuples of base relations
should be filled because it is preliminary fetched by whole-row var
if earlier row-locking, or by RefetchForeignRow if later row-locking.
In case of CustomScan, it can call ExecProcNode() to generate the
first tuple even if it does not exists.
Anyway, I assume all the component tuples of this join can be fetched
using existing EPQ slot because they are owned by base relations.

2. The recheck callback fills up ss_ScanTupleSlot according to the
fdw_scan_tlist or custom_scan_tlist. The callback knows the best way
to reconstruct the joined tuple from the base relations' tuple fetched
on the step-1.
For example, if joined tuple is consists of (t1.a, t1.b, t2.x, t3.s),
the callback picks up 't1.a' and 't1.b' from the tuple fetched from
the EPQ slot of t1, then put these values onto the 1st and 2nd slot.
Also, it picks up 't2.x' from the tuple fetched from the EPQ slot of
t2, then put this value onto the 3rd slot. Same as above for 't3'.
At this point, ss_ScanTupleSlot gets filled up by the expected fields
as if join clauses are satisfied.

3. The recheck callback also checks qualifiers of base relations that
are pushed down. Because expression nodes kept in fds_exprs or
custom_exprs are initialized to reference ss_ScanTupleSlot at setrefs.c,
it is more reasonable to run ExecQual after the step-2.
If one of the qualifiers of base relation was evaluated as false,
the recheck callback returns an empty slot.

4. The recheck callback also checks join-clauses to join underlying
base relations. Due to same reason at step-3, it is more reasonable
to execute ExecQual after the step-2.
If one of the join-clauses was evaluated as false, the recheck returns
an empty slot.
Elsewhere, it returns ss_ScanTupleSlot, then ExecScan will process
any further jobs.

Even though Fujita-san's patch implements the step-2 to step-4 using
alternative local plan with no other option, it stands on similar concept.
- EPQ slot contains the tuple of base relation that contributed the join.
- FDW/CSP knows the best how to construct the joined-tuple.
- Joined tuple is constructed on the fly, not kept in a particular EPQ slot.

> The bottom line is that a foreign scan that is a pushed-down join is
> still a *scan*, and every already-existing scan type has an EPQ slot
> *for a reason*. They *need* it in order to deliver the correct
> behavior. And foreign scans and custom scans need it to, and for the
> same reason.
>
Probably, it is the reason of mismatch for the solution.
Even though ForeignScan/CustomScan is categorized to scan node, from the
standpoint of the core backend, it is expected to take responsibility of
join in addition to scan of base relation.
This multi-roleness gives ForeignScan/CustomScan capability and
responsibility to handle multiple EPQ slots, for join rechecks.

Please assume the reason why existing scan node is associated with
a particular EPQ slot is that it has only one role; to scan a particular
base relation. But, what is natural manner if a scan node actually has
multiple roles?

> On Fri, Oct 16, 2015 at 7:48 PM, Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com> wrote:
> > My opinion is, simply, ForeignScan/CustomScan with scanrelid==0 takes
> > over the responsibility of EPQ recheck of entire join sub-tree that is
> > replaced by the ForeignScan/CustomScan node.
> > If ForeignScan run a remote join on foreign tables: A and B, it shall
> > apply both of scan-quals and join-clause towards the tuples kept in
> > the EPQ slots, in some fashion depending on FDW implementation.
>
> And my opinion, as I said before, is that's completely wrong. The
> ForeignScan which represents a pushed-down join is a *scan*. In
> general, scans have one EPQ slot, and that is the right number. This
> pushed-down join scan, though, is in a state of confusion. The code
> that populates the EPQ slots thinks it's got multiple slots, one per
> underlying relation. Meanwhile, the code that reads data back out of
> those slots thinks it doesn't have any slots at all. Both of those
> pieces of code are wrong. This foreign scan, like any other scan,
> should use ONE slot.
>
> Both you and Etsuro Fujita are proposing to fix this problem by
> somehow making it the FDW's problem to reconstruct the tuple
> previously produced by the join from whole-row images of the baserels.
> But that's not looking back far enough: why are we asking for
> whole-row images of the baserels when what we really want is a
> whole-row image of the output of the join? The output of the join is
> what we need to re-return.
>
Yes, the output of the join is exactly what we need to re-return.
On the other hands, the joined tuple image is depends on the latest
image of base relation's tuples that construct joined tuples.

Once a part of the base relations' tuple is re-fetched and updated,
it affects to the contents of joined tuple and its visibility.
It means, more or less, we need to have capability to reconstruct
joined-tuple from the base relations again, in addition to rechecks.

Therefore, I concluded that joined-tuple re-construction by FDW/CSP
on the fly is reasonably implementable and less invasive approach
than others.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai(at)ak(dot)jp(dot)nec(dot)com>

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Langote 2015-10-19 04:25:14 Minor comment fix
Previous Message Thomas Munro 2015-10-19 04:12:29 Re: Making tab-complete.c easier to maintain