Re: Use extended statistics to estimate (Var op Var) clauses

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
Cc: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Use extended statistics to estimate (Var op Var) clauses
Date: 2021-08-10 23:04:56
Message-ID: 5e97a976-c8a1-f420-4088-58df178505dc@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 8/9/21 9:19 PM, Mark Dilger wrote:
>
>
>> On Jul 20, 2021, at 11:28 AM, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>>
>> Tomas Vondra
>> EnterpriseDB: http://www.enterprisedb.com
>> The Enterprise PostgreSQL Company
>> <0001-Handling-Expr-op-Expr-clauses-in-extended-stats-20210720.patch>
>
> Hi Tomas,
>
> I tested this patch against master looking for types of clauses that uniformly get worse with the patch applied. I found some.
>
> The tests are too large to attach, but the scripts that generate them are not. To perform the tests:
>
> git checkout master
> perl ./gentest.pl > src/test/regress/sql/gentest.sql
> cat /dev/null > src/test/regress/expected/gentest.out
> echo "test: gentest" >> src/test/regress/parallel_schedule
> ./configure && make && make check
> cp src/test/regress/results/gentest.out src/test/regress/expected/gentest.out
> patch -p 1 < 0001-Handling-Expr-op-Expr-clauses-in-extended-stats-20210720.patch
> make check
> cat src/test/regress/regression.diffs | perl ./check.pl
>
> This shows patterns of conditions that get worse, such as:
>
> better:0, worse:80: A < B and A <> A or not A < A
> better:0, worse:80: A < B and not A <= A or A <= A
> better:0, worse:80: A < B or A = A
> better:0, worse:80: A < B or A = A or not A >= A
> better:0, worse:80: A < B or A >= A
> better:0, worse:80: A < B or A >= A and not A <> A
> better:0, worse:80: A < B or not A < A
> better:0, worse:80: A < B or not A <> A
> better:0, worse:80: A < B or not A <> A or A <= A
> better:0, worse:80: A < B or not A >= A or not A < A
>
> It seems things get worse when the conditions contain a column compared against itself. I suspect that is being handled incorrectly.
>

Thanks for this testing!

I took a quick look, and I think this is mostly due to luck in how the
(default) range estimates combine without and with extended statistics.
Consider for example this simple example:

create table t (a int, b int);

insert into t select mod(i,10), mod(i,20)
from generate_series(1,1000000) s(i);

Without stats, the first clauses example is estimated like this:

explain (timing off, analyze) select * from t
where (A < B and A <> A) or not A < A;

QUERY PLAN
----------------------------------------------------------
Seq Scan on t (cost=0.00..21925.00 rows=554444 width=8)
(actual rows=1000000 loops=1)
Filter: (((a < b) AND (a <> a)) OR (a >= a))
Planning Time: 0.054 ms
Execution Time: 80.485 ms
(4 rows)

and with MCV on (a,b) it gets estimates like this:

QUERY PLAN

----------------------------------------------------------
Seq Scan on t (cost=0.00..21925.00 rows=333333 width=8)
(actual rows=1000000 loops=1)
Filter: (((a < b) AND (a <> a)) OR (a >= a))
Planning Time: 0.152 ms
Execution Time: 79.917 ms
(4 rows)

So with the statistics, the estimate gets a bit worse. The reason is
fairly simple - if you look at the two parts of the OR clause, we get this:

clause actual no stats with stats
---------------------------------------------------------------
(A < B and A <> A) 0 331667 1
not (A < A) 1000000 333333 333333

This clearly shows that the first clause is clearly improved, while the
(A < A) is estimated the same way, because the clause has a single Var
so it's considered to be "simple" so we ignore the MCV selectivity and
just use the simple_sel calculated by clause_selectivity_ext.

And the 333333 and 331667 just happen to be closer to the actual row
count. But that's mostly by luck, clearly.

But now that I think about it, maybe the problem really is in how
statext_mcv_clauselist_selectivity treats this clause - the definition
of "simple" clauses as "has one attnum" was appropriate when only
clauses (Var op Const) were supported. But with (Var op Var) that's
probably not correct anymore.

And indeed, commenting out the if condition on line 1933 (so ignoring
simple_sel) and that does improve the estimates for this query. But
perhaps I'm missing something, this needs more thought.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2021-08-10 23:59:02 Re: Worth using personality(ADDR_NO_RANDOMIZE) for EXEC_BACKEND on linux?
Previous Message Alvaro Herrera 2021-08-10 23:00:09 Re: Autovacuum on partitioned table (autoanalyze)