Re: Getting sorted data from foreign server

From: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Getting sorted data from foreign server
Date: 2015-10-14 11:30:35
Message-ID: CAFjFpRfngjrpCwp7J4=vJqZFKO-vrck+VDea2D3yvWb28m55aw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

PFA the patch with all the comments addressed.

On Tue, Oct 13, 2015 at 10:07 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Tue, Oct 13, 2015 at 3:29 AM, Ashutosh Bapat
> <ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
> >> - You consider pushing down ORDER BY if any prefix of the query
> >> pathkeys satisfy is_foreign_expr(), but that doesn't seem right to me.
> >> If the user says SELECT * FROM remotetab ORDER BY a, unsafe(a),
> >> ordering the result set by a doesn't help us much. We've talked a few
> >> times about an incremental sort capability that would take a stream of
> >> tuples already ordered by one or more columns and sort each group by
> >> additional columns, but I don't think we have that currently. Without
> >> that capability, I don't think there's much benefit in sorting by a
> >> prefix of the pathkeys. I suspect that if we can't get them all, it's
> >> not worth doing.
> >
> > I somehow thought, we are using output, which is ordered by prefix of
> > pathkeys in Sort nodes. But as you rightly pointed out that's not the
> case.
> > Only complete pathkeys are useful.
>
> A truncated list of pathkeys is useful for merge joins, but not for
> toplevel ordering.
>

Ok. Taken care in the attached patch.

>
> >> - Right now, you have this code below the point where we bail out if
> >> use_remote_estimate is not set. If we keep it like that, the comment
> >> needs updating. But I suggest that we consider an ordered path even
> >> if we are not using remote estimates. Estimate the sorted path to
> >> cost somewhat more than the unsorted path, so that we only choose that
> >> path if the sort actually benefits us. I don't know exactly how to
> >> come up with a principled estimate given that we don't actually know
> >> whether the remote side will need an extra sort or not, but maybe a
> >> dumb estimate is still better than not trying a sorted path.
> >
> > I like that idea, although there are two questions
> > 1. How can we estimate cost of getting the data sorted? If there is an
> > appropriate index on foreign server we can get the data sorted at no
> extra
> > cost. If there isn't the cost of sorting is proportionate to NlogN where
> N
> > is the size of data. It seems unreasonable to arrive at the cost of
> sorting
> > by multiplying with some constant multiplier. Also, the constant
> multiplier
> > to the NlogN estimate depends heavily upon the properties of foreign
> server
> > e.g. size of memory available for sorting, disk and CPU speed etc. The
> last
> > two might have got factored into fdw_tuple_cost and fdw_startup_cost, so
> > that's probably taken care of. If the estimate we come up turns out to be
> > too pessimistic, we will not get sorted data even if that's the right
> thing
> > to do. If too optimistic, we will incur heavy cost at the time of
> execution.
> > Setting the cost estimate to some constant factor of NlogN would be too
> > pessimistic if there is an appropriate index on foreign server. Otherway
> > round if there isn't an appropriate index on foreign server.
> >
> > Even if we leave these arguments aside for a while, the question remains
> as
> > to what should be the constant factor 10% or 20% or 50% or 100% or
> something
> > else on top of the estimate for simple foreign table scan estimates (or
> > NlogN of that)? I am unable to justify any of these factors myself. What
> do
> > you say?
>
> I think we want to estimate the cost in such a way that we'll tend to
> pick the ordered path if it's useful, but skip it if it's not. So,
> say we pick 10%. That's definitely enough that we won't pick a remote
> sort when it's useless, but it's small enough that if a remote sort is
> useful, we will probably choose to do it. I think that's what we
> want. I believe we should err on the side of a small estimate because
> it's generally better to do as much work as possible on the remote
> side. In some cases the sort may turn out to be free at execution
> time because the remote server was going to generate the results in
> that order anyway, and it may know that because of its own pathkeys,
> and thus be able to skip the explicit ordering step.
>

The patch uses a factor of 1.1 (10% increase) to multiple the startup and
total costs in fpinfo for unsorted data.

This change has caused the plans for few queries in the test postgres_fdw
to change. There is ORDER BY and LIMIT clause in the queries in
postgres_fdw testcase to keep test outputs sane and consistent. These ORDER
BY clauses are not being pushed down the foreign servers. I tried using
values upto 2 for this but still the foreign paths with pathkeys won over
those without pathkeys.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachment Content-Type Size
pg_sort_pd_v2.patch text/x-patch 30.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Иван Фролков 2015-10-14 11:53:00 Re: [HACKERS] Database schema diff
Previous Message Masahiko Sawada 2015-10-14 11:24:01 Re: Support for N synchronous standby servers - take 2