Re: Risk Estimation WAS: Planner hints in Postgresql

From: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Risk Estimation WAS: Planner hints in Postgresql
Date: 2014-03-20 15:07:11
Message-ID: CAOeZVifb5TEKiKjxJDsb_LZTk=ufwjuVK99x4qetx5XyLfSVcg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Mar 20, 2014 at 8:10 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Tue, Mar 18, 2014 at 2:41 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > Atri Sharma <atri(dot)jiit(at)gmail(dot)com> writes:
> >> One of the factors that leads to bad estimates is that the histogram of
> the
> >> values of a column maintained by the planner gets old by time and the
> data
> >> in the column changes. So, the histogram is no longer a quite accurate
> view
> >> of the data and it leads to bad selectivity.
> >
> > TBH, this is so far down the list of problems that it'll be a long time
> > before we need to worry about it. It's certainly not the number one
> > priority for any project to model risk in the planner.
> >
> > The thing that I think is probably the number one problem is estimates
> > that depend on an assumption of uniform distribution of sought-after rows
> > among those encountered by a scan. This is usually where bad plans for
> > LIMIT queries are coming from. We could certainly add some sort of fudge
> > factor to those costs, but I'd like to have a more-or-less principled
> > framework for doing so.
>
> I think the problem is, in some sense, more basic than that. I think
> the kind of query we're talking about here is:
>
> SELECT * FROM foo WHERE unlikely ORDER BY indexed_column LIMIT 1
>
> Assume for the sake of argument that there are 100 rows that would be
> returned in the absence of the limit. Let SC and TC be the startup
> cost and total cost of the index scan. As a matter of general policy,
> we're going to say that the cost of this is SC + 0.01 * (TC - SC).
> What makes this path look appealing to the planner is that SC is small
> relative to TC. If we knew, for example, that we weren't going to
> find the first match until 90% of the way through the index scan, then
> we could set SC = 90% * TC and, all else being equal, the planner
> would make the right decision.
>
> So you might think that the problem here is that we're assuming
> uniform density. Let's say there are a million rows in the table, and
> there are 100 that match our criteria, so the first one is going to
> happen 1/10,000'th of the way through the table. Thus we set SC =
> 0.0001 * TC, and that turns out to be an underestimate if the
> distribution isn't as favorable as we're hoping. However, that is NOT
> what we are doing. What we are doing is setting SC = 0. I mean, not
> quite 0, but yeah, effectively 0. Essentially we're assuming that no
> matter how selective the filter condition may be, we assume that it
> will match *the very first row*.
>
>

Cannot we reuse the same histogram we have in the planner right now for
this? I mean, AFAIK, the heuristic we have is that we divide the histogram
into equal size buckets and then find the bucket in which our predicate
value lies, then take some part of that bucket and the rest of the buckets
before that bucket,right?

So, suppose a query is SELECT * FROM table WHERE a > 10, we shall find the
bucket that 10 lies in, right?

Now, why cannot we take the estimate of all the buckets behind the bucket
in which our value is present? Will that estimate not give us the fraction
of tuples that are expected to be before the first matching row?

Its pretty wild, but I wanted to know if my understanding of this scenario
is correct or not.

Regards,

Atri

--
Regards,

Atri
*l'apprenant*

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2014-03-20 15:21:10 Re: Risk Estimation WAS: Planner hints in Postgresql
Previous Message Tom Lane 2014-03-20 15:02:36 Re: Review: plpgsql.extra_warnings, plpgsql.extra_errors