Re: CustomScan under the Gather node?

From: Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: CustomScan under the Gather node?
Date: 2016-01-28 15:50:59
Message-ID: 9A28C8860F777E439AA12E8AEA7694F8011A3071@BPXM15GP.gisp.nec.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> If I would make a proof-of-concept patch with interface itself, it
> seems to me file_fdw may be a good candidate for this enhancement.
> It is not a field for postgres_fdw.
>
The attached patch is enhancement of FDW/CSP interface and PoC feature
of file_fdw to scan source file partially. It was smaller enhancement
than my expectations.

It works as follows. This query tried to read 20M rows from a CSV file,
using 3 background worker processes.

postgres=# set max_parallel_degree = 3;
SET
postgres=# explain analyze select * from test_csv where id % 20 = 6;
QUERY PLAN
--------------------------------------------------------------------------------
Gather (cost=1000.00..194108.60 rows=94056 width=52)
(actual time=0.570..19268.010 rows=2000000 loops=1)
Number of Workers: 3
-> Parallel Foreign Scan on test_csv (cost=0.00..183703.00 rows=94056 width=52)
(actual time=0.180..12744.655 rows=500000 loops=4)
Filter: ((id % 20) = 6)
Rows Removed by Filter: 9500000
Foreign File: /tmp/testdata.csv
Foreign File Size: 1504892535
Planning time: 0.147 ms
Execution time: 19330.201 ms
(9 rows)

I'm not 100% certain whether this implementation of file_fdw is reasonable
for partial read, however, the callbacks located on the following functions
enabled to implement a parallel-aware custom logic based on the coordination
information.

> * ExecParallelEstimate
> * ExecParallelInitializeDSM
> * ExecParallelInitializeWorker

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai(at)ak(dot)jp(dot)nec(dot)com>

> -----Original Message-----
> From: Kaigai Kouhei(海外 浩平)
> Sent: Thursday, January 28, 2016 9:33 AM
> To: 'Robert Haas'
> Cc: pgsql-hackers(at)postgresql(dot)org
> Subject: Re: [HACKERS] CustomScan under the Gather node?
>
> > On Tue, Jan 26, 2016 at 1:30 AM, Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com> wrote:
> > > What enhancement will be necessary to implement similar feature of
> > > partial seq-scan using custom-scan interface?
> > >
> > > It seems to me callbacks on the three points below are needed.
> > > * ExecParallelEstimate
> > > * ExecParallelInitializeDSM
> > > * ExecParallelInitializeWorker
> > >
> > > Anything else?
> > > Does ForeignScan also need equivalent enhancement?
> >
> > For postgres_fdw, running the query from a parallel worker would
> > change the transaction semantics. Suppose you begin a transaction,
> > UPDATE data on the foreign server, and then run a parallel query. If
> > the leader performs the ForeignScan it will see the uncommitted
> > UPDATE, but a worker would have to make its own connection which not
> > be part of the same transaction and which would therefore not see the
> > update. That's a problem.
> >
> Ah, yes, as long as FDW driver ensure the remote session has no
> uncommitted data, pg_export_snapshot() might provide us an opportunity,
> however, once a session writes something, FDW driver has to prohibit it.
>
> > Also, for postgres_fdw, and many other FDWs I suspect, the assumption
> > is that most of the work is being done on the remote side, so doing
> > the work in a parallel worker doesn't seem super interesting. Instead
> > of incurring transfer costs to move the data from remote to local, we
> > incur two sets of transfer costs: first remote to local, then worker
> > to leader. Ouch. I think a more promising line of inquiry is to try
> > to provide asynchronous execution when we have something like:
> >
> > Append
> > -> Foreign Scan
> > -> Foreign Scan
> >
> > ...so that we can return a row from whichever Foreign Scan receives
> > data back from the remote server first.
> >
> > So it's not impossible that an FDW author could want this, but mostly
> > probably not. I think.
> >
> Yes, I also have same opinion. Likely, local parallelism is not
> valuable for the class of FDWs that obtains data from the remote
> server (e.g, postgres_fdw, ...), expect for the case when packing
> and unpacking cost over the network is major bottleneck.
>
> On the other hands, it will be valuable for the class of FDW that
> performs as a wrapper to local data structure, as like current
> partial seq-scan doing. (e.g, file_fdw, ...)
> Its data source is not under the transaction control, and 'remote
> execution' of these FDWs are eventually executed on the local
> computing resources.
>
> If I would make a proof-of-concept patch with interface itself, it
> seems to me file_fdw may be a good candidate for this enhancement.
> It is not a field for postgres_fdw.
>
> Thanks,
> --
> NEC Business Creation Division / PG-Strom Project
> KaiGai Kohei <kaigai(at)ak(dot)jp(dot)nec(dot)com>

Attachment Content-Type Size
pgsql-v9.6-parallel-cspfdw.v1.patch application/octet-stream 25.0 KB
image/png 21.5 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2016-01-28 16:04:30 Re: Additional role attributes && superuser review
Previous Message Stephen Frost 2016-01-28 15:40:09 Re: Template for commit messages