Re: a wrong index choose when statistics is out of date

From: Andrei Lepikhov <a(dot)lepikhov(at)postgrespro(dot)ru>
To: Andy Fan <zhihuifan1213(at)163(dot)com>
Cc: David Rowley <dgrowleyml(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: a wrong index choose when statistics is out of date
Date: 2024-03-07 08:17:10
Message-ID: 701d2097-2c5b-41e2-8629-734e3c8ba613@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 5/3/2024 19:56, Andy Fan wrote:
> I think it is OK for a design review, for the implementaion side, the
> known issue includes:
>
> 1. Support grap such infromation from its parent for partitioned table
> if the child doesn't have such information.
> 2. builtin document and testing.
>
> Any feedback is welcome.
Thanks for your efforts.
I was confused when you showed the problem connected to clauses like
"Var op Const" and "Var op Param".
As far as I know, the estimation logic of such clauses uses MCV and
number-distinct statistics. So, being out of MCV values, it becomes
totally insensitive to any internal skew in data and any data outside
the statistics boundaries.
Having studied the example you provided with the patch, I think it is
not a correct example:
Difference between var_eq_const and var_eq_non_const quite obvious:
In the second routine, you don't have information about the const value
and can't use MCV for estimation. Also, you can't exclude MCV values
from the estimation. And it is just luck that you've got the right
answer. I think if you increased the weight of the unknown part, you
would get a bad result, too.
I would like to ask David why the var_eq_const estimator doesn't have an
option for estimation with a histogram. Having that would relieve a
problem with skewed data. Detecting the situation with incoming const
that is out of the covered area would allow us to fall back to ndistinct
estimation or something else. At least, histogram usage can be
restricted by the reltuples value and ratio between the total number of
MCV values and the total number of distinct values in the table.

Just for demo: demonstration of data skew issue:

CREATE EXTENSION tablefunc;
CREATE TABLE norm_test AS
SELECT abs(r::integer) AS val
FROM normal_rand(1E7::integer, 5.::float8, 300.::float8) AS r;
ANALYZE norm_test;

-- First query is estimated with MCV quite precisely:
EXPLAIN ANALYZE SELECT * FROM norm_test WHERE val = 100;
-- result: planned rows=25669, actual rows=25139

-- Here we have numdistinct estimation, mostly arbitrary:
EXPLAIN ANALYZE SELECT * FROM norm_test WHERE val = 200;
-- result: planned rows=8604, actual rows=21239
EXPLAIN ANALYZE SELECT * FROM norm_test WHERE val = 500;
-- result: planned rows=8604, actual rows=6748
EXPLAIN ANALYZE SELECT * FROM norm_test WHERE val = 600;
-- result: planned rows=8604, actual rows=3501
EXPLAIN ANALYZE SELECT * FROM norm_test WHERE val = 700;
-- result: planned rows=8604, actual rows=1705
EXPLAIN ANALYZE SELECT * FROM norm_test WHERE val = 1000;
-- result: planned rows=8604, actual rows=91

--
regards,
Andrei Lepikhov
Postgres Professional

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message John Naylor 2024-03-07 08:18:11 Re: [PoC] Improve dead tuple storage for lazy vacuum
Previous Message Ronan Dunklau 2024-03-07 07:59:02 Re: Provide a pg_truncate_freespacemap function