postgres_fdw & async queries

From: Stephen Frost <sfrost(at)snowman(dot)net>
To: pgsql-hackers(at)postgresql(dot)org
Subject: postgres_fdw & async queries
Date: 2013-11-04 03:26:04
Message-ID: 20131104032604.GB2706@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Greetings,

I was poking around the Append node and considering my earlier
suggestion to build an Async method of pulling data out of nodes under
Append which support that option. It looks workable, but I was also
considering simply changing postgres_fdw to use async queries instead-
which are started at ExecInit time instead.

A few more specifics about what I've been thinking:

#1 - Add an Async mechanism to the executor

- ExecAsyncCapable - Check if the node supports async.
- ExecAsyncStart - Begin an async request from the node
- ExecAsyncGetFD - Return FD to be used with select()
- ExecAsyncConsume - Consume data from FD
- ExecAsyncBusy - Check if getResult would block
- ExecAsyncScan - Get next tuple
- All of the above added to the FDW API also..

This is clearly modeled off of the libpq async system and can likely
be simplified, but it feels like we'd want at least things like
GetFD/Consume/Busy, to allow us to build a select() inside Append
which would wait until data is available somewhere and would then
work through all of the nodes (dealing with ones which are done and
don't return anything) until it's able to return a row back up. As
an aside- I've always wondered if we should have an explicit bulk
mechanism instead of one-row-at-a-time and perhaps we build that
into this API. I was also thinking we would have an identifier of
some kind provided through the API which would indicate which I/O
channels are shared and we'd then set up Append with a two-level
list of paths to walk, where it would issue one request against each
distinct I/O channel and simply move on to the next entry for the
same I/O channel when the prior one completes.

#2 - Make postgres_fdw build/send an async query during ExecInitNode.

Currently, we wait to set up the remote cursor and fetch records
until we've actually been asked for a record- but then we go and try
to get 100 of them. It would seem like we might be able to simply
call create_cursor() at the bottom of postgresBeginForeignScan and
follow that up with a PQsendQuery, postgresIterateForeignScan
wouldn't really change except for not being asked to also create the
cursor. Sure, we'd end up blocking if there isn't data available
for this node yet, but at least we'd get the query started during
Init across all the remote servers, which would certainly be a
massive improvment and we wouldn't need to modify Append or the FDW
API at all. Practically speaking, we'd parallelize the initial
request of 100 tuples and then scan through the results in order,
meaning we'd only pull from one machine at a time if the result set
is larger than that initial 100 tuples per system. That said, if
most of the work on the remote systems is getting the query started,
it would still be a win.

This goes against the current documention, which explicitly states
that the actual scan should not be started until IterateForeignScan,
but it's not clear, to me at least, why that's strictly necessary.

One other thought going in favor of #1 is that we could make other
nodes, such as SeqScan, support the new API which would allow us to
parallelize across, say, mutliple tablespaces (on the presumption that
they are independent I/O systems underneath, which may or may not be
true, of course; perhaps we could explicitly ask the user for the I/O
channel relationship during tablespace creation).

With either suggestion, we would need to ensure that postgres_fdw
works through the entire Async query and stores the results before
trying to do another Async query, if we're going to keep the
one-connection model, as we can't have two Async queries running at
the same time. I don't see that as a terrible issue though- we're
already fetching/cacheing up to 100 rows of data for the foreign
table anyway, and we'd still have the actual cursor. If we get a
request for a different cursor while we have an Async still open, we'd
just finish out the current Async request of 100-or-fewer tuples,
cache them, and then send a new request for the new relation.

This is all speculation at this point as I've not gotten down into
the details of trying to implement any of it, so please feel free to
point out any big holes. :)

Thoughts?

Thanks,

Stephen

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2013-11-04 04:21:26 Re: Shave a few instructions from child-process startup sequence
Previous Message Claudio Freire 2013-11-04 01:51:12 Re: RFC: Async query processing