Re: A wrong index choose issue because of inaccurate statistics

From: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To: David Rowley <dgrowleyml(at)gmail(dot)com>
Cc: Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: A wrong index choose issue because of inaccurate statistics
Date: 2020-06-05 06:30:05
Message-ID: CAFj8pRCmSyq53YoB_1sn+eD6VPq2hpE10A-mHg2osbNCuGvznQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

pá 5. 6. 2020 v 8:19 odesílatel David Rowley <dgrowleyml(at)gmail(dot)com> napsal:

> On Mon, 1 Jun 2020 at 01:24, Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com> wrote:
> > The one-line fix describe the exact idea in my mind:
> >
> > +++ b/src/backend/optimizer/path/costsize.c
> > @@ -730,6 +730,13 @@ cost_index(IndexPath *path, PlannerInfo *root,
> double loop_count,
> >
> > cpu_run_cost += cpu_per_tuple * tuples_fetched;
> >
> > + /*
> > + * To make the planner more robust to handle some inaccurate
> statistics
> > + * issue, we will add a extra cost to qpquals so that the less
> qpquals
> > + * the lower cost it has.
> > + */
> > + cpu_run_cost += 0.01 * list_length(qpquals);
> > +
> >
> > This change do fix the issue above, but will it make some other cases
> worse? My
> > answer is no based on my current knowledge, and this is most important
> place
> > I want to get advised. The mainly impact I can figure out is: it not only
> > change the cost difference between (a, b) and (a, c) it also cause the
> cost
> > difference between Index scan on (a, c) and seq scan. However the
> > cost different between index scan and seq scan are huge by practice so
> > I don't think this impact is harmful.
>
> Didn't that idea already get shot down in the final paragraph on [1]?
>
> I understand that you wish to increase the cost by some seemingly
> innocent constant to fix your problem case. Here are my thoughts
> about that: Telling lies is not really that easy to pull off. Bad
> liers think it's easy and good ones know it's hard. The problem is
> that the lies can start small, but then at some point the future you
> must fashion some more lies to account for your initial lie. Rinse and
> repeat that a few times and before you know it, your course is set
> well away from the point of truth. I feel the part about "rinse and
> repeat" applies reasonably well to how join costing works. The lie is
> likely to be amplified as the join level gets deeper.
>
> I think you need to think of a more generic solution and propose that
> instead. There are plenty of other quirks in the planner that can
> cause suffering due to inaccurate or non-existing statistics. For
> example, due to how we multiply individual selectivity estimates,
> having a few correlated columns in a join condition can cause the
> number of join rows to be underestimated. Subsequent joins can then
> end up making bad choices on which join operator to use based on those
> inaccurate row estimates. There's also a problem with WHERE <x> ORDER
> BY col LIMIT n; sometimes choosing an index that provides pre-sorted
> input to the ORDER BY but cannot use <x> as an indexable condition.
> We don't record any stats to make better choices there, maybe we
> should, but either way, we're taking a bit risk there as all the rows
> matching <x> might be right at the end of the index and we might need
> to scan the entire thing before hitting the LIMIT. For now, we just
> assume completely even distribution of rows. i.e. If there are 50 rows
> estimated in the path and the limit is for 5 rows, then we'll assume
> we need to read 10% of those before finding all the ones we need. In
> reality, everything matching <x> might be 95% through the index and we
> could end up reading 100% of rows. That particular problem is not just
> caused by the uneven distribution of rows in the index, but also from
> selectivity underestimation.
>
> I'd more recently wondered if we shouldn't have some sort of "risk"
> factor to the cost model. I really don't have ideas on how exactly we
> would calculate the risk factor in all cases, but in this case, say
> the risk factor always starts out as 1. We could multiply that risk
> factor by some >1 const each time we find another index filter qual.
> add_path() can prefer lower risk paths when the costs are similar.
> Unsure what the exact add_path logic would be. Perhaps a GUC would
> need to assist with the decision there. Likewise, with
> NestedLoopPaths which have a large number of join quals, the risk
> factor could go up a bit with those so that we take a stronger
> consideration for hash or merge joins instead.
>
>
I thought about these ideas too. And I am not alone.

https://hal.archives-ouvertes.fr/hal-01316823/document

Regards

Pavel

Anyway, it's pretty much a large research subject which would take a
> lot of work to iron out even just the design. It's likely not a
> perfect idea, but I think it has a bit more merit that trying to
> introduce lies to the cost modal to account for a single case where
> there is a problem.
>
> David
>
> [1]
> https://www.postgresql.org/message-id/20200529001602.eu7vuiouuuiclpgb%40development
>
>
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2020-06-05 07:31:50 Re: OpenSSL 3.0.0 compatibility
Previous Message Michael Paquier 2020-06-05 06:22:28 Re: Removal of currtid()/currtid2() and some table AM cleanup