Re: Question about optimising (Postgres_)FDW

From: Hannu Krosing <hannu(at)krosing(dot)net>
To: Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Question about optimising (Postgres_)FDW
Date: 2014-04-16 16:10:02
Message-ID: 534EAB5A.2040708@krosing.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 04/16/2014 03:16 PM, Hannu Krosing wrote:
> On 04/16/2014 01:35 PM, Etsuro Fujita wrote:
>> (2014/04/16 6:55), Hannu Krosing wrote:
> ...
>> Maybe I'm missing something, but I think that you can do what I think
>> you'd like to do by the following procedure:
> No, what I'd like PostgreSQL to do is to
>
> 1. select the id+set from local table
> 2. select the rows from remote table with WHERE ID IN (<set selected in
> step 1>)
> 3. then join the original set to selected set, with any suitable join
> strategy
>
> The things I do not want are
>
> A. selecting all rows from remote table
> (this is what your examples below do)
>
> or
>
> B. selecting rows from remote table by single selects using "ID = $"
> (this is something that I managed to do by some tweaking of costs)
>
> as A will be always slow if there are millions of rows in remote table
> and B is slow(ish) when the idset is over a few hundred ids
>
> I hope this is a bit better explanation than I provided before .
>
> Cheers
> Hannu
>
> P.S. I am not sure if this is a limitation of postgres_fdw or postgres
> itself
>
> P.P.S I tested a little with with Multicorn an postgresql did not
> request row
> counts for any IN plans, so it may be that the planner does not consider
> this
> kind of plan at all. (testing was on PgSQL 9.3.4)
>
> Hannu
Also a sample run of the two plans to illustrate my point

How it is run now:

testdb=# explain analyse verbose
select r.data, l.data
from onemillion_pgfdw r
join onemillion l
on r.id = l.id and l.id between 100000 and 100100;

QUERY
PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=111.61..198.40 rows=1 width=16) (actual
time=7534.360..8731.541 rows=101 loops=1)
Output: r.data, l.data
Hash Cond: (r.id = l.id)
-> Foreign Scan on public.onemillion_pgfdw r (cost=100.00..178.25
rows=2275 width=12) (actual time=1.628..8364.688 rows=1000000 loops=1)
Output: r.id, r.inserted, r.data
Remote SQL: SELECT id, data FROM public.onemillion
-> Hash (cost=10.39..10.39 rows=98 width=12) (actual
time=0.179..0.179 rows=101 loops=1)
Output: l.data, l.id
Buckets: 1024 Batches: 1 Memory Usage: 5kB
-> Index Scan using onemillion_pkey on public.onemillion l
(cost=0.42..10.39 rows=98 width=12) (actual time=0.049..0.124 rows=101
loops=1)
Output: l.data, l.id
Index Cond: ((l.id >= 100000) AND (l.id <= 100100))
Total runtime: 8732.213 ms
(13 rows)

Time: 8733.799 ms

And how the above query should be planned/executed:

testdb=# explain analyse verbose
select r.data, l.data
from (select * from onemillion_pgfdw where id = any (array(select id
from onemillion where id between 100000 and 100100))) r
join onemillion l
on r.id = l.id;

QUERY
PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=110.81..1104.30 rows=111 width=16) (actual
time=2.756..3.738 rows=101 loops=1)
Output: onemillion_pgfdw.data, l.data
InitPlan 1 (returns $0)
-> Index Only Scan using onemillion_pkey on public.onemillion
(cost=0.42..10.39 rows=98 width=4) (actual time=0.055..0.118 rows=101
loops=1)
Output: onemillion.id
Index Cond: ((onemillion.id >= 100000) AND (onemillion.id <=
100100))
Heap Fetches: 101
-> Foreign Scan on public.onemillion_pgfdw (cost=100.00..163.41
rows=111 width=12) (actual time=2.729..3.012 rows=101 loops=1)
Output: onemillion_pgfdw.id, onemillion_pgfdw.inserted,
onemillion_pgfdw.data
Remote SQL: SELECT id, data FROM public.onemillion WHERE ((id =
ANY ($1::integer[])))
-> Index Scan using onemillion_pkey on public.onemillion l
(cost=0.42..8.37 rows=1 width=12) (actual time=0.005..0.006 rows=1
loops=101)
Output: l.id, l.inserted, l.data
Index Cond: (l.id = onemillion_pgfdw.id)
Total runtime: 4.469 ms
(14 rows)

Time: 6.437 ms

>> postgres=# ALTER SERVER loop OPTIONS (ADD fdw_startup_cost '1000');
>> ALTER SERVER
>> postgres=# EXPLAIN VERBOSE SELECT * FROM onemillion_pgsql WHERE id in
>> (SELECT id FROM onemillion WHERE data > '0.9' LIMIT 100);
>> QUERY PLAN
>> -----------------------------------------------------------------------------------------------
>>
>> Hash Semi Join (cost=1023.10..41983.21 rows=100 width=30)
>> Output: onemillion_pgsql.id, onemillion_pgsql.inserted,
>> onemillion_pgsql.data
>> Hash Cond: (onemillion_pgsql.id = onemillion.id)
>> -> Foreign Scan on public.onemillion_pgsql
>> (cost=1000.00..39334.00 rows=1000000 width=29)
>> Output: onemillion_pgsql.id, onemillion_pgsql.inserted,
>> onemillion_pgsql.data
>> Remote SQL: SELECT id, inserted, data FROM public.onemillion
>> -> Hash (cost=21.85..21.85 rows=100 width=4)
>> Output: onemillion.id
>> -> Limit (cost=0.00..20.85 rows=100 width=4)
>> Output: onemillion.id
>> -> Seq Scan on public.onemillion (cost=0.00..20834.00
>> rows=99918 width=4)
>> Output: onemillion.id
>> Filter: (onemillion.data > '0.9'::text)
>> Planning time: 0.690 ms
>> (14 rows)
>>
>> or, that as Tom mentioned, by disabling the use_remote_estimate function:
>>
>> postgres=# ALTER FOREIGN TABLE onemillion_pgsql OPTIONS (SET
>> use_remote_estimate 'false');
>> ALTER FOREIGN TABLE
>> postgres=# EXPLAIN VERBOSE SELECT * FROM onemillion_pgsql WHERE id in
>> (SELECT id FROM onemillion WHERE data > '0.9' LIMIT 100);
>> QUERY PLAN
>> ----------------------------------------------------------------------------------------------
>>
>> Hash Semi Join (cost=123.10..41083.21 rows=100 width=30)
>> Output: onemillion_pgsql.id, onemillion_pgsql.inserted,
>> onemillion_pgsql.data
>> Hash Cond: (onemillion_pgsql.id = onemillion.id)
>> -> Foreign Scan on public.onemillion_pgsql (cost=100.00..38434.00
>> rows=1000000 width=30)
>> Output: onemillion_pgsql.id, onemillion_pgsql.inserted,
>> onemillion_pgsql.data
>> Remote SQL: SELECT id, inserted, data FROM public.onemillion
>> -> Hash (cost=21.85..21.85 rows=100 width=4)
>> Output: onemillion.id
>> -> Limit (cost=0.00..20.85 rows=100 width=4)
>> Output: onemillion.id
>> -> Seq Scan on public.onemillion (cost=0.00..20834.00
>> rows=99918 width=4)
>> Output: onemillion.id
>> Filter: (onemillion.data > '0.9'::text)
>> Planning time: 0.215 ms
>> (14 rows)
>>
>> Thanks,
>>
>> Best regards,
>> Etsuro Fujita
>>
>>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2014-04-16 16:11:43 Re: bgworker crashed or not?
Previous Message Robert Haas 2014-04-16 16:08:44 Re: Dynamic Shared Memory stuff