| From: | Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com> | 
|---|---|
| To: | Robert Haas <robertmhaas(at)gmail(dot)com> | 
| Cc: | "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org> | 
| Subject: | Re: CustomScan under the Gather node? | 
| Date: | 2016-01-28 15:50:59 | 
| Message-ID: | 9A28C8860F777E439AA12E8AEA7694F8011A3071@BPXM15GP.gisp.nec.co.jp | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-hackers | 
> If I would make a proof-of-concept patch with interface itself, it
> seems to me file_fdw may be a good candidate for this enhancement.
> It is not a field for postgres_fdw.
>
The attached patch is enhancement of FDW/CSP interface and PoC feature
of file_fdw to scan source file partially. It was smaller enhancement
than my expectations.
It works as follows. This query tried to read 20M rows from a CSV file,
using 3 background worker processes.
postgres=# set max_parallel_degree = 3;
SET
postgres=# explain analyze select * from test_csv where id % 20 = 6;
                                  QUERY PLAN
--------------------------------------------------------------------------------
 Gather  (cost=1000.00..194108.60 rows=94056 width=52)
         (actual time=0.570..19268.010 rows=2000000 loops=1)
   Number of Workers: 3
   ->  Parallel Foreign Scan on test_csv  (cost=0.00..183703.00 rows=94056 width=52)
                                  (actual time=0.180..12744.655 rows=500000 loops=4)
         Filter: ((id % 20) = 6)
         Rows Removed by Filter: 9500000
         Foreign File: /tmp/testdata.csv
         Foreign File Size: 1504892535
 Planning time: 0.147 ms
 Execution time: 19330.201 ms
(9 rows)
I'm not 100% certain whether this implementation of file_fdw is reasonable
for partial read, however, the callbacks located on the following functions
enabled to implement a parallel-aware custom logic based on the coordination
information.
> * ExecParallelEstimate
> * ExecParallelInitializeDSM
> * ExecParallelInitializeWorker
Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai(at)ak(dot)jp(dot)nec(dot)com>
> -----Original Message-----
> From: Kaigai Kouhei(海外 浩平)
> Sent: Thursday, January 28, 2016 9:33 AM
> To: 'Robert Haas'
> Cc: pgsql-hackers(at)postgresql(dot)org
> Subject: Re: [HACKERS] CustomScan under the Gather node?
> 
> > On Tue, Jan 26, 2016 at 1:30 AM, Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com> wrote:
> > > What enhancement will be necessary to implement similar feature of
> > > partial seq-scan using custom-scan interface?
> > >
> > > It seems to me callbacks on the three points below are needed.
> > > * ExecParallelEstimate
> > > * ExecParallelInitializeDSM
> > > * ExecParallelInitializeWorker
> > >
> > > Anything else?
> > > Does ForeignScan also need equivalent enhancement?
> >
> > For postgres_fdw, running the query from a parallel worker would
> > change the transaction semantics.  Suppose you begin a transaction,
> > UPDATE data on the foreign server, and then run a parallel query.  If
> > the leader performs the ForeignScan it will see the uncommitted
> > UPDATE, but a worker would have to make its own connection which not
> > be part of the same transaction and which would therefore not see the
> > update.  That's a problem.
> >
> Ah, yes, as long as FDW driver ensure the remote session has no
> uncommitted data, pg_export_snapshot() might provide us an opportunity,
> however, once a session writes something, FDW driver has to prohibit it.
> 
> > Also, for postgres_fdw, and many other FDWs I suspect, the assumption
> > is that most of the work is being done on the remote side, so doing
> > the work in a parallel worker doesn't seem super interesting.  Instead
> > of incurring transfer costs to move the data from remote to local, we
> > incur two sets of transfer costs: first remote to local, then worker
> > to leader.  Ouch.  I think a more promising line of inquiry is to try
> > to provide asynchronous execution when we have something like:
> >
> > Append
> > -> Foreign Scan
> > -> Foreign Scan
> >
> > ...so that we can return a row from whichever Foreign Scan receives
> > data back from the remote server first.
> >
> > So it's not impossible that an FDW author could want this, but mostly
> > probably not.  I think.
> >
> Yes, I also have same opinion. Likely, local parallelism is not
> valuable for the class of FDWs that obtains data from the remote
> server (e.g, postgres_fdw, ...), expect for the case when packing
> and unpacking cost over the network is major bottleneck.
> 
> On the other hands, it will be valuable for the class of FDW that
> performs as a wrapper to local data structure, as like current
> partial seq-scan doing. (e.g, file_fdw, ...)
> Its data source is not under the transaction control, and 'remote
> execution' of these FDWs are eventually executed on the local
> computing resources.
> 
> If I would make a proof-of-concept patch with interface itself, it
> seems to me file_fdw may be a good candidate for this enhancement.
> It is not a field for postgres_fdw.
> 
> Thanks,
> --
> NEC Business Creation Division / PG-Strom Project
> KaiGai Kohei <kaigai(at)ak(dot)jp(dot)nec(dot)com>
| Attachment | Content-Type | Size | 
|---|---|---|
| pgsql-v9.6-parallel-cspfdw.v1.patch | application/octet-stream | 25.0 KB | 
|   | image/png | 21.5 KB | 
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Stephen Frost | 2016-01-28 16:04:30 | Re: Additional role attributes && superuser review | 
| Previous Message | Stephen Frost | 2016-01-28 15:40:09 | Re: Template for commit messages |