Enforcing Parameterised Nested Loop Join Order for Foreign Table Joins

From: Adam Zegelin <adam(at)relational(dot)io>
To: pgsql-general(at)postgresql(dot)org
Subject: Enforcing Parameterised Nested Loop Join Order for Foreign Table Joins
Date: 2013-03-18 04:09:22
Message-ID: 0DF4E77A-8276-44DC-84F9-0B588733484D@relational.io
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hello,

I’m in the process of writing a Postgres FDW that can interface with web service endpoints. Certain FDW columns would act as web service parameters, while others would be the output.

For example:

adam=# select * from bing where query = 'xbox';
query | url | description
-------+---------------------------------+---------------------
xbox | http://www.xbox.com/en-AU/index | Xbox Australia is
⋮ | ⋮ | ⋮

For the simple cases, extracting the quals (such as var [query] = const “xbox”) works perfectly.

I’d like to join FDW tables with other tables, possibly local or foreign.
ex: `select * from search_terms, bing where bing.query = search_terms.term`, where `search_terms` is a local table.

Without any parameterised paths, Postgres, as expected, will attempt to perform unfiltered foreign sequence scans followed by a hash join of the two tables. Some service endpoints have no concept of unqualified queries. In the example above, a ‘sequence scan’ of Bing is a not possible.

I generate parameterised paths inside the FDW handler function `GetForeignPaths`. I call `create_foreignscan_path` with a set of req_outer relids found by scanning PlannerInfo’s eq_classes, left_join_clauses and right_join_clauses.

Bitmapset* outer_relids = NULL;

foreach(lc, root->eq_classes) {
EquivalenceClass* ec = (EquivalenceClass *) lfirst(lc);
if (ec->ec_members->length > 1)
outer_relids = bms_union(outer_relids, ec->ec_relids);
}

foreach(lc, list_union(root->left_join_clauses, root->right_join_clauses)) {
RestrictInfo *ri = (RestrictInfo *) lfirst(lc);
outer_relids = bms_union(outer_relids, ri->outer_relids);
}

Bitmapset* req_outer = bms_difference(outer_relids,
bms_make_singleton(baserel->relid));

foreignPath = create_foreignscan_path(root, baserel, nbrows, 0, 0, NIL, req_outer, NULL);

For simple joins this works. `BeginForeignScan` can access a list of quals with params, and the planner generates an appropriate nested loop join over the foreign table.

adam=# explain select * from search_terms, bing where bing.query = search_terms.term;
QUERY PLAN
-----------------------------------------------------------------------
Nested Loop (cost=0.00..49.30 rows=6550 width=96)
-> Seq Scan on search_terms (cost=0.00..23.10 rows=1310 width=32)
-> Foreign Scan on bing (cost=0.00..0.00 rows=2 width=64)
Filter: (search_terms.term = query)

Even a query over two foreign tables works correctly:

adam=# create foreign table ft1 (inp text, out text) server test;
adam=# create foreign table ft2 (inp text, out text) server test;

adam=# explain select * from ft1, ft2 where ft1.inp = ‘hello’ and ft2.inp = ft1.out;
QUERY PLAN
----------------------------------------------------------------------
Nested Loop (cost=0.00..500020.00 rows=5000 width=128)
-> Foreign Scan on ft1 (cost=0.00..500000.00 rows=1000 width=64)
Filter: (inp = 'hello'::text)
-> Foreign Scan on ft2 (cost=0.00..0.00 rows=2 width=64)
Filter: (ft1."out" = inp)

But, on a more complex query consisting of multiple foreign tables the planner generates something a little less desirable:

adam=# create foreign table ft3 (inp text, out text) server test;

adam=# explain select * from ft1, ft2, ft3 where ft1.inp = 'hello' and ft2.inp = ft1.out and ft3.inp = ft2.out;
QUERY PLAN
----------------------------------------------------------------------------------
Nested Loop (cost=500012.50..1000290.00 rows=25000 width=192)
-> Hash Join (cost=500012.50..1000190.00 rows=5000 width=128)
Hash Cond: (ft1."out" = ft2.inp)
-> Foreign Scan on ft1 (cost=0.00..500000.00 rows=1000 width=64)
Filter: (inp = 'hello'::text)
-> Hash (cost=500000.00..500000.00 rows=1000 width=64)
-> Foreign Scan on ft2 (cost=0.00..500000.00 rows=1000 width=64)
-> Foreign Scan on ft3 (cost=0.00..0.00 rows=2 width=64)
Filter: (ft2."out" = inp)

The high total costs are the result of my attempts to coerce the planner to select the parameterised paths and generate filtered foreign scans rather than preferring unfiltered foreign scans.

I’ve tried adjusting the query planner tuneables (enable_hashjoin, et al) and the path costs with some degree of success, but often the generated plans will filter the tables in the wrong order -- the output column of table 1 will be filtered by the input column of table 2 -- which is technically correct as the operation should be associative and transitive, but in this case, table 2 must be filtered by the output of table 1, not vice-versa.

Is there a way to convince Postgres to always generate nested loop joins without sub merges, hashes or materialisations and in the correct order?

I’m thinking that the current method I’m using to generate the foreign scan path doesn’t take into account the required order and which fields can and cannot be parameterised. I’m not sure how to proceed with this.

Thanks,
Adam

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Craig Ringer 2013-03-18 05:07:15 Re: Trust intermediate CA for client certificates
Previous Message Craig Ringer 2013-03-18 01:54:34 Re: [HACKERS] Trust intermediate CA for client certificates