Re: Asynchronous Append on postgres_fdw nodes.

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: thomas(dot)munro(at)gmail(dot)com
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Asynchronous Append on postgres_fdw nodes.
Date: 2020-03-05 08:44:24
Message-ID: 20200305.174424.937266824649246262.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thank you for the comment.

At Thu, 5 Mar 2020 16:17:24 +1300, Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote in
> On Fri, Feb 28, 2020 at 9:08 PM Kyotaro Horiguchi
> <horikyota(dot)ntt(at)gmail(dot)com> wrote:
> > - v2-0001-Allow-wait-event-set-to-be-regsitered-to-resoure.patch
> > The async feature uses WaitEvent, and it needs to be released on
> > error. This patch makes it possible to register WaitEvent to
> > resowner to handle that case..
>
> +1
>
> > - v2-0002-infrastructure-for-asynchronous-execution.patch
> > It povides an abstraction layer of asynchronous behavior
> > (execAsync). Then adds ExecAppend, another version of ExecAppend,
> > that handles "async-capable" subnodes asynchronously. Also it
> > contains planner part that makes planner aware of "async-capable"
> > and "async-aware" path nodes.
>
> > This patch add an infrastructure for asynchronous execution. As a PoC
> > this makes only Append capable to handle asynchronously executable
> > subnodes.
>
> What other nodes do you think could be async aware? I suppose you
> would teach joins to pass through the async support of their children,
> and then you could make partition-wise join work like that.

An Append node is fed from many immediate-child async-capable nodes,
so the Apeend node can pick any child node that have fired.

Unfortunately joins are not wide but deep. That means a set of
async-capable nodes have multiple async-aware (and async capable at
the same time for intermediate nodes) parent nodes. So if we want to
be async on that configuration, we need "push-up" executor engine. In
my last trial, ignoring performane, I could turn almost all nodes into
push-up style but a few nodes, like WindowAgg or HashJoin have a quite
low affinity with push-up style since the caller sites to child nodes
are many and scattered. I got through the low-affinity by turning the
nodes into state machines, but I don't think it is good.

> + /* choose appropriate version of Exec function */
> + if (appendstate->as_nasyncplans == 0)
> + appendstate->ps.ExecProcNode = ExecAppend;
> + else
> + appendstate->ps.ExecProcNode = ExecAppendAsync;
>
> Cool. No extra cost for people not using the new feature.

It creates some duplicate code but I agree on the performance
perspective.

> + slot = ExecProcNode(subnode);
> + if (subnode->asyncstate == AS_AVAILABLE)
>
> So, now when you execute a node, you get a result AND you get some
> information that you access by reaching into the child node's
> PlanState. The ExecProcNode() interface is extremely limiting, but
> I'm not sure if this is the right way to extend it. Maybe
> ExecAsyncProcNode() with a wide enough interface to do the job?

Sounds resonable and seems easy to do.

> +Bitmapset *
> +ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes, long timeout)
> +{
> + static int *refind = NULL;
> + static int refindsize = 0;
> ...
> + if (refindsize < n)
> ...
> + static ExecAsync_mcbarg mcb_arg =
> + { &refind, &refindsize };
> + static MemoryContextCallback mcb =
> + { ExecAsyncMemoryContextCallback, (void *)&mcb_arg, NULL };
> ...
> + MemoryContextRegisterResetCallback(TopTransactionContext, &mcb);
>
> This seems a bit strange. Why not just put the pointer in the plan
> state? I suppose you want to avoid allocating a new buffer for every
> query. Perhaps you could fix that by having a small fixed-size buffer
> in the PlanState to cover common cases and allocating a larger one in
> a per-query memory context if that one is too small, or just not
> worrying about it and allocating every time since you know the desired
> size.

The most significant factor for the shape would be ExecAsync is not a
kind of ExecNode. So ExecAsyncEventWait doen't have direcgt access to
EState other than one of given mutiple nodes. I consider tryig to use
given ExecNodes as an access path to ESttate.

> + wes = CreateWaitEventSet(TopTransactionContext,
> TopTransactionResourceOwner, n);
> ...
> + FreeWaitEventSet(wes);
>
> BTW, just as an FYI, I am proposing[1] to add support for
> RemoveWaitEvent(), so that you could have a single WaitEventSet for
> the lifetime of the executor node, and then add and remove sockets
> only as needed. I'm hoping to commit that for PG13, if there are no
> objections or better ideas soon, because it's useful for some other
> places where we currently create and destroy WaitEventSets frequently.

Yes! I have wanted that (but haven't done by myself..., and I didn't
understand the details from the title "Reducint WaitEventSet syscall
churn":p)

> One complication when working with long-lived WaitEventSet objects is
> that libpq (or some other thing used by some other hypothetical
> async-capable FDW) is free to close and reopen its socket whenever it
> wants, so you need a way to know when it has done that. In that patch
> set I added pqSocketChangeCount() so that you can see when pgSocket()
> refers to a new socket (even if the file descriptor number is the same
> by coincidence), but that imposes some book-keeping duties on the
> caller, and now I'm wondering how that would look in your patch set.

As for postgres-fdw, unsponaneous disconnection immedately leands to
query ERROR.

> My goal is to generate the minimum number of systems calls. I think
> it would be nice if a 1000-shard query only calls epoll_ctl() when a
> child node needs to be added or removed from the set, not
> epoll_create(), 1000 * epoll_ctl(), epoll_wait(), close() for every
> wait. But I suppose there is an argument that it's more complication
> than it's worth.
>
> [1] https://commitfest.postgresql.org/27/2452/

I'm not sure how it gives performance gain, but reducing syscalls
itself is good. I'll look on it.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2020-03-05 08:45:34 Re: logical replication empty transactions
Previous Message Daniel Gustafsson 2020-03-05 08:40:07 Re: PG_COLOR not mentioned in docs of vacuumlo, oid2name and pgbench