Re: A wrong index choose issue because of inaccurate statistics

From: Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com>
To: David Rowley <dgrowleyml(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: A wrong index choose issue because of inaccurate statistics
Date: 2020-06-09 03:33:03
Message-ID: CAKU4AWrRA4oXrORsCvXgy1r3oC_eGZ5DZBR_TBc0W+O9z_fcaw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jun 5, 2020 at 2:19 PM David Rowley <dgrowleyml(at)gmail(dot)com> wrote:

> On Mon, 1 Jun 2020 at 01:24, Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com> wrote:
> > The one-line fix describe the exact idea in my mind:
> >
> > +++ b/src/backend/optimizer/path/costsize.c
> > @@ -730,6 +730,13 @@ cost_index(IndexPath *path, PlannerInfo *root,
> double loop_count,
> >
> > cpu_run_cost += cpu_per_tuple * tuples_fetched;
> >
> > + /*
> > + * To make the planner more robust to handle some inaccurate
> statistics
> > + * issue, we will add a extra cost to qpquals so that the less
> qpquals
> > + * the lower cost it has.
> > + */
> > + cpu_run_cost += 0.01 * list_length(qpquals);
> > +
> >
> > This change do fix the issue above, but will it make some other cases
> worse? My
> > answer is no based on my current knowledge, and this is most important
> place
> > I want to get advised. The mainly impact I can figure out is: it not only
> > change the cost difference between (a, b) and (a, c) it also cause the
> cost
> > difference between Index scan on (a, c) and seq scan. However the
> > cost different between index scan and seq scan are huge by practice so
> > I don't think this impact is harmful.
>
> Didn't that idea already get shot down in the final paragraph on [1]?
>
>
Thanks for chiming in. I treat this as I didn't describe my idea clearly
enough then
both Tomas and Tom didn't spend much time to read it (no offense, and I
understand they need to do lots of things every day), so I re-summarize the
issue to make it easier to read.

In Tomas's reply, he raises concerns about how to fix the issue, since we
don't know how much it errored 1%, 10% and so on, so I emphasized I don't
touch that part actually. Even the wrong estimation still plays a bad role
on
later join, but that would not be the issue I would fix here.

I understand that you wish to increase the cost by some seemingly
> innocent constant to fix your problem case. Here are my thoughts
> about that: Telling lies is not really that easy to pull off. Bad
> liers think it's easy and good ones know it's hard. The problem is
> that the lies can start small, but then at some point the future you
> must fashion some more lies to account for your initial lie. Rinse and
> repeat that a few times and before you know it, your course is set
> well away from the point of truth. I feel the part about "rinse and
> repeat" applies reasonably well to how join costing works. The lie is
> likely to be amplified as the join level gets deeper.
>

I agree with that to some extent. However we can just provide more options
to
users. At the same time, I still believe we should provide such options
very carefully.

> Unsure what the exact add_path logic would be. Perhaps a GUC would
> need to assist with the decision there. Likewise, with
> NestedLoopPaths which have a large number of join quals, the risk
> factor could go up a bit with those so that we take a stronger
> consideration for hash or merge joins instead.

I probably underestimated the impacts for a large number of join quals.
This looks like a weakness we can't ignore confomforably, so I checked
the code further, maybe we don't need a risk factor for this case.

For query WHERE a = 1 AND b = 2, both Selectivity(a = 1) and
Selectivity(b = 2) are greater than 0 even the statistics are stale enough,
so the IndexSelectivity is greater than 0 as well. Based on this,
IndexSelectivity(A, B) should be less than IndexSelectivity(A, C) for the
above query However they still generate the same cost because of the
below code. (genericcostestimate and btcostestimate)

numIndexTuples = indexSelectivity * index->rel->tuples;
numIndexTuples = rint(numIndexTuples / num_sa_scans);

if (numIndexTuples < 1.0)

numIndexTuples = 1.0;

later numIndexTuples is used to calculate cost. The above code eats the
difference of indexSelectivity.

Many places say we need to "round to integer" but I am still not figuring
out
why it is a must. so If we can "round to integer" just after the
IndexCost
is calculated, the issue can be fixed as well.

The attached is the patch in my mind, since this patch may lower the index
costs if the numIndexTuples < 1.0, so it makes the optimizer prefer to use
nest loop rather than a hash join if the loop is small, which is a common
case in our regression test where we don't have much data, so there are
several changes like that.

Another impact I found in the regression test is that optimizer choose an
IndexScan of a conditional index rather than IndexOnlyScan a normal index.
I checked the code and didn't find anything wrong, so I'd like to say that
is
because the data is too small.

I also tested TPC-H workload, Query-12 changed hash join to nest loop,
The execution time changed from1605.118ms to 699.032ms.

> the root cause of this situation is because IndexSelectivity = 0.

This is wrong. I got the 0 by elog(INFO, "%f", IndexSelectivity).
that reported 0 while the value is pretty small.

--
Best Regards
Andy Fan

Attachment Content-Type Size
patched.log application/octet-stream 77.3 KB
normal.log application/octet-stream 75.9 KB
v1-0001-Delay-the-round-to-integer-of-numIndexTuples-so-t.patch application/octet-stream 19.2 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2020-06-09 03:41:21 Re: Intermittent test plan change in "privileges" test on BF animal prion
Previous Message Tom Lane 2020-06-09 03:31:55 Re: valgrind versus pg_atomic_init()