Re: Asynchronous Append on postgres_fdw nodes.

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: etsuro(dot)fujita(at)gmail(dot)com
Cc: pryzby(at)telsasoft(dot)com, a(dot)lepikhov(at)postgrespro(dot)ru, movead(dot)li(at)highgo(dot)ca, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Asynchronous Append on postgres_fdw nodes.
Date: 2021-01-15 07:54:33
Message-ID: 20210115.165433.1389870179026901348.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

At Sat, 19 Dec 2020 17:55:22 +0900, Etsuro Fujita <etsuro(dot)fujita(at)gmail(dot)com> wrote in
> On Mon, Dec 14, 2020 at 4:01 PM Kyotaro Horiguchi
> <horikyota(dot)ntt(at)gmail(dot)com> wrote:
> > At Sat, 12 Dec 2020 18:25:57 +0900, Etsuro Fujita <etsuro(dot)fujita(at)gmail(dot)com> wrote in
> > > On Fri, Nov 20, 2020 at 3:51 PM Kyotaro Horiguchi
> > > <horikyota(dot)ntt(at)gmail(dot)com> wrote:
> > > > The reason for
> > > > the early fetching is letting fdw send the next request as early as
> > > > possible. (However, I didn't measure the effect of the
> > > > nodeAppend-level prefetching.)
> > >
> > > I agree that that would lead to an improved efficiency in some cases,
> > > but I still think that that would be useless in some other cases like
> > > SELECT * FROM sharded_table LIMIT 1. Also, I think the situation
> > > would get worse if we support Append on top of joins or aggregates
> > > over ForeignScans, which would be more expensive to perform than these
> > > ForeignScans.
> >
> > I'm not sure which gain we weigh on, but if doing "LIMIT 1" on Append
> > for many times is more common than fetching all or "LIMIT <many
> > multiples of fetch_size>", that discussion would be convincing... Is
> > it really the case?
>
> I don't have a clear answer for that... Performance in the case you
> mentioned would be improved by async execution without prefetching by
> Append, so it seemed reasonable to me to remove that prefetching to
> avoid unnecessary overheads in the case I mentioned. BUT: I started
> to think my proposal, which needs an additional FDW callback routine
> (ie, ForeignAsyncBegin()), might be a bad idea, because it would
> increase the burden on FDW authors.

I agree on the point of developers' burden.

> > > If we do prefetching, I think it would be better that it’s the
> > > responsibility of the FDW to do prefetching, and I think that that
> > > could be done by letting the FDW to start another data fetch,
> > > independently of the core, in the ForeignAsyncNotify callback routine,
> >
> > FDW does prefetching (if it means sending request to remote) in my
> > patch, so I agree to that. It suspect that you were intended to say
> > the opposite. The core (ExecAppendAsyncGetNext()) controls
> > prefetching in your patch.
>
> No. That function just tries to retrieve a tuple from any of the
> ready subplans (ie, subplans marked as as_needrequest).

Mmm. I meant that the function explicitly calls
ExecAppendAsyncRequest(), which finally calls fetch_more_data_begin()
(if needed). Conversely if the function dosn't call
ExecAppendAsyncRequsest, the next request to remote doesn't
happen. That is, after the tuple buffer of FDW-side is exhausted, the
next request doesn't happen until executor requests for the next
tuple. You seem to be saying that "postgresForeignAsyncRequest() calls
fetch_more_data_begin() following its own decision." but this doesn't
seem to be "prefetching".

> > > which I revived from Robert's original patch. I think that that would
> > > be more efficient, because the FDW would no longer need to wait until
> > > all buffered tuples are returned to the core. In the WIP patch, I
> >
> > I don't understand. My patch sends a prefetch-query as soon as all the
> > tuples of the last remote-request is stored into FDW storage. The
> > reason for removing ExecAsyncNotify() was it is just redundant as far
> > as concerning Append asynchrony. But I particulary oppose to revive
> > the function.
>
> Sorry, my explanation was not good, but what I'm saying here is about
> my patch, not your patch. I think this FDW callback routine would be
> useful; it allows an FDW to perform another asynchronous data fetch
> before delivering a tuple to the core as discussed in [1]. Also, it
> would be useful when extending to the case where we have intermediate
> nodes between an Append and a ForeignScan such as joins or aggregates,
> which I'll explain below.

Yeah. If a not-immediate parent of an async-capable node works as
async-aware, the notify API would have the power. So I don't object to
the API.

> > > only allowed the callback routine to put the corresponding ForeignScan
> > > node into a state where it’s either ready for a new request or needing
> > > a callback for another data fetch, but I think we could probably relax
> > > the restriction so that the ForeignScan node can be put into another
> > > state where it’s ready for a new request while needing a callback for
> > > the prefetch.
> >
> > I don't understand this, too. ExecAsyncNotify() doesn't touch any of
> > the bitmaps, as_needrequest, callback_pending nor as_asyncpending in
> > your patch. Am I looking into something wrong? I'm looking
> > async-wip-2020-11-17.patch.
>
> In the WIP patch I post, these bitmaps are modified in the core side
> based on the callback_pending and request_complete flags in
> AsyncRequests returned from FDWs (See ExecAppendAsyncEventWait()).

Sorry. I think I misread you here. I agree that, the notify API is not
so useful now but would be useful if we allow notify descendents other
than immediate children. However, I stumbled on the fact that some
kinds of node doesn't return a result when all the underlying nodes
returned *a* tuple. Concretely count(*) doesn't return after *all*
tuple of the counted relation has been returned. I remember that the
fact might be the reason why I removed the API. After all the topmost
async-aware node must ask every immediate child if it can return a
tuple.

> > (By the way, it is one of those that make the code hard to read to me
> > that the "callback" means "calling an API function". I think none of
> > them (ExecAsyncBegin, ExecAsyncRequest, ExecAsyncNotify) are a
> > "callback".)
>
> I thought the word “callback” was OK, because these functions would
> call the corresponding FDW callback routines, but I’ll revise the
> wording.

I'm not confident on the usage of "callback", though:p (Sorry.) I
believe that "callback" is a function a caller tells a callee to call
it. In broader meaning, all FDW APIs are a function that an FDW
extention tells the core to call it (yeah, that's inversed.). However,
we don't call fread a callback of libc. They work based on slightly
different mechanism but substantially the same, I think.

> > > The reason why I disabled async execution when executing EPQ is to
> > > avoid sending asynchronous queries to the remote sides, which would be
> > > useless, because scan tuples for an EPQ recheck are obtained in a
> > > dedicated way.
> >
> > If EPQ is performed onto Append, I think it should gain from
> > asynchronous execution since it is going to fetch *a* tuple from
> > several partitions or children. I believe EPQ doesn't contain Append
> > in major cases, though. (Or I didn't come up with the steps for the
> > case to happen...)
>
> Sorry, I don’t understand this part. Could you elaborate a bit more on it?

EPQ retrieves a specific tuple from a node. If we perform EPQ on an
Append, only one of the children should offer a result tuple. Since
Append has no idea of which of its children will offer a result, it
has no way other than asking all children until it receives a
result. If we do that, asynchronously sending a query to all nodes
would win.

> > > What do you mean by "push-up style executor"?
> >
> > The reverse of the volcano-style executor, which enters from the
> > topmost node and down to the bottom. In the "push-up stule executor",
> > the bottom-most nodes fires by a certain trigger then every
> > intermediate nodes throws up the result to the parent until reaching
> > the topmost node.
>
> That is what I'm thinking to be able to support the case I mentioned
> above. I think that that would allow us to find ready subplans
> efficiently from occurred wait events in ExecAppendAsyncEventWait().
> Consider a plan like this:
>
> Append
> -> Nested Loop
> -> Foreign Scan on a
> -> Foreign Scan on b
> -> ...
>
> I assume here that Foreign Scan on a, Foreign Scan on b, and Nested
> Loop are all async-capable and that we have somewhere in the executor
> an AsyncRequest with requestor="Nested Loop" and requestee="Foreign
> Scan on a", an AsyncRequest with requestor="Nested Loop" and
> requestee="Foreign Scan on b", and an AsyncRequest with
> requestor="Append" and requestee="Nested Loop". In
> ExecAppendAsyncEventWait(), if a file descriptor for foreign table a
> becomes ready, we would call ForeignAsyncNotify() for a, and if it
> returns a tuple back to the requestor node (ie, Nested Loop) (using
> ExecAsyncResponse()), then *ForeignAsyncNotify() would be called for
> Nested Loop*. Nested Loop would then call ExecAsyncRequest() for the
> inner requestee node (ie, Foreign Scan on b; I assume here that it is
> a foreign scan parameterized by a). If Foreign Scan on b returns a
> tuple back to the requestor node (ie, Nested Loop) (using
> ExecAsyncResponse()), then Nested Loop would match the tuples from the
> outer and inner sides. If they match, the join result would be
> returned back to the requestor node (ie, Append) (using
> ExecAsyncResponse()), marking the Nested Loop subplan as
> as_needrequest. Otherwise, Nested Loop would call ExecAsyncRequest()
> for the inner requestee node for the next tuple, and so on. If
> ExecAsyncRequest() can't return a tuple immediately, we would wait
> until a file descriptor for foreign table b becomes ready; we would
> start from calling ForeignAsyncNotify() for b when the file descriptor
> becomes ready. In this way we could find ready subplans efficiently
> from occurred wait events in ExecAppendAsyncEventWait() when extending
> to the case where subplans are joins or aggregates over Foreign Scans,
> I think. Maybe I’m missing something, though.

Maybe so. As I mentioned above, in the follwoing case..

Join -1
Join -2
ForegnScan -A
ForegnScan -B
ForegnScan -C

Where the Join-1 is the leader of asynchronous fetching. Even if both
of the FS-A,B have returned one tuple each, it's unsure that Join-2
returns a tuple. I'm not sure how to resolve the situation with the
current infrastructure as-is.

So I tried a structure where when a node gets a new tuple, the node
asks the parent whether it is satisfied or not. In that trial I needed
to make every execnodes a state machine and that was pretty messy..

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2021-01-15 07:56:27 Re: Is it worth accepting multiple CRLs?
Previous Message japin 2021-01-15 07:49:19 Re: Logical Replication - behavior of ALTER PUBLICATION .. DROP TABLE and ALTER SUBSCRIPTION .. REFRESH PUBLICATION