Re: Print correct startup cost for the group aggregate.

From: Rushabh Lathia <rushabh(dot)lathia(at)gmail(dot)com>
To: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Print correct startup cost for the group aggregate.
Date: 2017-03-06 09:16:59
Message-ID: CAGPqQf2SjttOm=9h1VduDRLicq1=qau1fNbp8rEAZCmvEUYSBQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thanks Ashutosh & Robert for the explanation.

On Mon, Mar 6, 2017 at 10:02 AM, Ashutosh Bapat <
ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:

> On Sat, Mar 4, 2017 at 2:50 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> > On Thu, Mar 2, 2017 at 6:48 PM, Ashutosh Bapat
> > <ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
> >> On Thu, Mar 2, 2017 at 6:06 PM, Rushabh Lathia <
> rushabh(dot)lathia(at)gmail(dot)com> wrote:
> >>> While reading through the cost_agg() I found that startup cost for the
> >>> group aggregate is not correctly assigned. Due to this explain plan is
> >>> not printing the correct startup cost.
> >>>
> >>> Without patch:
> >>>
> >>> postgres=# explain select aid, sum(abalance) from pgbench_accounts
> where
> >>> filler like '%foo%' group by aid;
> >>> QUERY PLAN
> >>> ------------------------------------------------------------
> -------------------------
> >>> GroupAggregate (cost=81634.33..85102.04 rows=198155 width=12)
> >>> Group Key: aid
> >>> -> Sort (cost=81634.33..82129.72 rows=198155 width=8)
> >>> Sort Key: aid
> >>> -> Seq Scan on pgbench_accounts (cost=0.00..61487.89
> rows=198155
> >>> width=8)
> >>> Filter: (filler ~~ '%foo%'::text)
> >>> (6 rows)
> >>>
> >>> With patch:
> >>>
> >>> postgres=# explain select aid, sum(abalance) from pgbench_accounts
> where
> >>> filler like '%foo%' group by aid;
> >>> QUERY PLAN
> >>> ------------------------------------------------------------
> -------------------------
> >>> GroupAggregate (cost=82129.72..85102.04 rows=198155 width=12)
> >>> Group Key: aid
> >>> -> Sort (cost=81634.33..82129.72 rows=198155 width=8)
> >>> Sort Key: aid
> >>> -> Seq Scan on pgbench_accounts (cost=0.00..61487.89
> rows=198155
> >>> width=8)
> >>> Filter: (filler ~~ '%foo%'::text)
> >>> (6 rows)
> >>>
> >>
> >> The reason the reason why startup_cost = input_startup_cost and not
> >> input_total_cost for aggregation by sorting is we don't need the whole
> >> input before the Group/Agg plan can produce the first row. But I think
> >> setting startup_cost = input_startup_cost is also not exactly correct.
> >> Before the plan can produce one row, it has to transit through all the
> >> rows belonging to the group to which the first row belongs. On an
> >> average it has to scan (total number of rows)/(number of groups)
> >> before producing the first aggregated row. startup_cost will be
> >> input_startup_cost + cost to scan (total number of rows)/(number of
> >> groups) rows + cost of transiting over those many rows. Total cost =
> >> startup_cost + cost of scanning and transiting through the remaining
> >> number of input rows.
> >
> > While that idea has some merit, I think it's inconsistent with current
> > practice. cost_seqscan(), for example, doesn't include the cost of
> > reading the first page in the startup cost, even though that certainly
> > must be done before returning the first row.
>
> OTOH, while costing for merge join, initial_cost_mergejoin() seems to
> consider the cost of scanning outer and inner relations upto the first
> matching tuple on both sides in the startup_cost. Different policies
> at different places.
>
> > I think there have been
> > previous discussions of switching over to the practice for which you
> > are advocating here, but my impression (without researching) is that
> > the current practice is more like what Rushabh did.
> >
>
> I am not sure Rushabh's approach is correct. Here's the excerpt from my
> mail.
>
> >> The reason the reason why startup_cost = input_startup_cost and not
> >> input_total_cost for aggregation by sorting is we don't need the whole
> >> input before the Group/Agg plan can produce the first row.
>
>

> With Rushabh's approach the startup cost of aggregation by sorting
> would be way higher that it's right now. Secondly, it would match that
> of hash aggregation and thus forcing hash aggregation to be chosen in
> almost all the cases.
>

I understood you reasoning of why startup_cost = input_startup_cost and not
input_total_cost for aggregation by sorting. But what I didn't understand is
how come higher startup cost for aggregation by sorting would force hash
aggregation to be chosen? I am not clear about this part.

--
> Best Wishes,
> Ashutosh Bapat
> EnterpriseDB Corporation
> The Postgres Database Company
>

Regards.
Rushabh Lathia
www.EnterpriseDB.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ashutosh Bapat 2017-03-06 09:19:57 Re: dropping partitioned tables without CASCADE
Previous Message Amit Langote 2017-03-06 09:09:32 Re: Logical replication and inheritance