Re: Asynchronous Append on postgres_fdw nodes.

From: Etsuro Fujita <etsuro(dot)fujita(at)gmail(dot)com>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc: Justin Pryzby <pryzby(at)telsasoft(dot)com>, Andrey Lepikhov <a(dot)lepikhov(at)postgrespro(dot)ru>, "movead(dot)li" <movead(dot)li(at)highgo(dot)ca>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Asynchronous Append on postgres_fdw nodes.
Date: 2021-01-18 04:06:23
Message-ID: CAPmGK17uiUOACYwVxre-qmjYeurhEPEEwTd4Rm4v-pXHRL8KvA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jan 15, 2021 at 4:54 PM Kyotaro Horiguchi
<horikyota(dot)ntt(at)gmail(dot)com> wrote:
> Mmm. I meant that the function explicitly calls
> ExecAppendAsyncRequest(), which finally calls fetch_more_data_begin()
> (if needed). Conversely if the function dosn't call
> ExecAppendAsyncRequsest, the next request to remote doesn't
> happen. That is, after the tuple buffer of FDW-side is exhausted, the
> next request doesn't happen until executor requests for the next
> tuple. You seem to be saying that "postgresForeignAsyncRequest() calls
> fetch_more_data_begin() following its own decision." but this doesn't
> seem to be "prefetching".

Let me explain a bit more. Actually, the new version of the patch
allows prefetching in the FDW side; for such prefetching in
postgres_fdw, I think we could add a fetch_more_data_begin() call in
postgresForeignAsyncNotify(). But I left that for future work,
because we don’t know yet if that’s really useful. (Another reason
why I left that is we have more important issues that should be
addressed [1], and I think addressing those issues is a requirement
for us to commit this patch, but adding such prefetching isn’t, IMO.)

> Sorry. I think I misread you here. I agree that, the notify API is not
> so useful now but would be useful if we allow notify descendents other
> than immediate children. However, I stumbled on the fact that some
> kinds of node doesn't return a result when all the underlying nodes
> returned *a* tuple. Concretely count(*) doesn't return after *all*
> tuple of the counted relation has been returned. I remember that the
> fact might be the reason why I removed the API. After all the topmost
> async-aware node must ask every immediate child if it can return a
> tuple.

The patch I posted, which revived Robert’s original patch using stuff
from your patch and Thomas’, provides ExecAsyncRequest() as well as
ExecAsyncNotify(), which supports pull-based execution like
ExecProcNode() (while ExecAsyncNotify() supports push-based
execution.) In the aggregate case you mentioned, I think we could
iterate calling ExecAsyncRequest() for the underlying subplan to get
all tuples from it, in a similar way to ExecProcNode() in the normal
case.

> EPQ retrieves a specific tuple from a node. If we perform EPQ on an
> Append, only one of the children should offer a result tuple. Since
> Append has no idea of which of its children will offer a result, it
> has no way other than asking all children until it receives a
> result. If we do that, asynchronously sending a query to all nodes
> would win.

Thanks for the explanation! But I’m still not sure why we need to
send an asynchronous query to each of the asynchronous nodes in an EPQ
recheck. Is it possible to explain a bit more about that?

I wrote:
> > That is what I'm thinking to be able to support the case I mentioned
> > above. I think that that would allow us to find ready subplans
> > efficiently from occurred wait events in ExecAppendAsyncEventWait().
> > Consider a plan like this:
> >
> > Append
> > -> Nested Loop
> > -> Foreign Scan on a
> > -> Foreign Scan on b
> > -> ...
> >
> > I assume here that Foreign Scan on a, Foreign Scan on b, and Nested
> > Loop are all async-capable and that we have somewhere in the executor
> > an AsyncRequest with requestor="Nested Loop" and requestee="Foreign
> > Scan on a", an AsyncRequest with requestor="Nested Loop" and
> > requestee="Foreign Scan on b", and an AsyncRequest with
> > requestor="Append" and requestee="Nested Loop". In
> > ExecAppendAsyncEventWait(), if a file descriptor for foreign table a
> > becomes ready, we would call ForeignAsyncNotify() for a, and if it
> > returns a tuple back to the requestor node (ie, Nested Loop) (using
> > ExecAsyncResponse()), then *ForeignAsyncNotify() would be called for
> > Nested Loop*. Nested Loop would then call ExecAsyncRequest() for the
> > inner requestee node (ie, Foreign Scan on b; I assume here that it is
> > a foreign scan parameterized by a). If Foreign Scan on b returns a
> > tuple back to the requestor node (ie, Nested Loop) (using
> > ExecAsyncResponse()), then Nested Loop would match the tuples from the
> > outer and inner sides. If they match, the join result would be
> > returned back to the requestor node (ie, Append) (using
> > ExecAsyncResponse()), marking the Nested Loop subplan as
> > as_needrequest. Otherwise, Nested Loop would call ExecAsyncRequest()
> > for the inner requestee node for the next tuple, and so on. If
> > ExecAsyncRequest() can't return a tuple immediately, we would wait
> > until a file descriptor for foreign table b becomes ready; we would
> > start from calling ForeignAsyncNotify() for b when the file descriptor
> > becomes ready. In this way we could find ready subplans efficiently
> > from occurred wait events in ExecAppendAsyncEventWait() when extending
> > to the case where subplans are joins or aggregates over Foreign Scans,
> > I think. Maybe I’m missing something, though.

> Maybe so. As I mentioned above, in the follwoing case..
>
> Join -1
> Join -2
> ForegnScan -A
> ForegnScan -B
> ForegnScan -C
>
> Where the Join-1 is the leader of asynchronous fetching. Even if both
> of the FS-A,B have returned one tuple each, it's unsure that Join-2
> returns a tuple. I'm not sure how to resolve the situation with the
> current infrastructure as-is.

Maybe my explanation was not good, so let me explain a bit more.
Assume that Join-2 is a nested loop join as shown above. If the
tuples from the outer/inner sides didn’t match, we could iterate
calling *ExecAsyncRequest()* for the inner side until a matched tuple
from it is found. If the inner side wasn’t able to return a tuple
immediately, 1) it would return request_complete=false to Join-2 using
ExecAsyncResponse(), and 2) we could wait for a file descriptor for
the inner side to become ready (while processing other part of the
Append tree), and 3) when the file descriptor becomes ready, recursive
ExecAsyncNotify() calls would restart the Join-2 processing in a
push-based manner as explained above.

Best regards,
Etsuro Fujita

[1] https://www.postgresql.org/message-id/CAPmGK14xrGe%2BXks7%2BfVLBoUUbKwcDkT9km1oFXhdY%2BFFhbMjUg%40mail.gmail.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Craig Ringer 2021-01-18 04:07:43 Re: Add docs stub for recovery.conf
Previous Message Thomas Munro 2021-01-18 03:39:58 Re: fdatasync(2) on macOS