Expr. extended stats are skipped with equality operator

From: Danny Shemesh <dany74q(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Expr. extended stats are skipped with equality operator
Date: 2022-08-05 13:43:36
Message-ID: CAFZC=QqD6=27wQPOW1pbRa98KPyuyn+7cL_Ay_Ck-roZV84vHg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hey all !

I'm on a quest to help the planner (on pg14) use the best of several
partial, expressional indices we have on some large tables (few TBs in
size, billions of records).

As we know, stats for expressions in partial indices aren't gathered by
default - so I'm tinkering with expressional extended stats to cover for
those.

I've tackled two interesting points there:
1. Seems like expressional stats involving the equality operator are
skipped or mismatched (fiddle
<https://www.db-fiddle.com/f/4jyoMCicNSZpjMt4jFYoz5/5379>)
Let's take the following naive example:

*create table t1 (x integer[]);insert into t1 select array[1]::integer[]
from generate_series(1, 100000, 1);create statistics s1 on (x[1] = 1) from
t1;analyze t1;*
*explain analyze select * from t1 where x[1] = 1;*
*> Seq Scan on t1 (cost=0.00..1986.00 rows=500 width=29) (actual
time=0.009..36.035 rows=100000 loops=1)*

Now, of course one can just create the stat on x[1] directly in this case,
but I have a more complex use case where an equality operator is
beneficial;

After debugging it a bit - it seems that the root cause here is that we go
through a flow where we only ever
consider statistics for the lhs of the expression, and not the entire
expression:
clause_selectivity_ext -> restriction_selectivity -> eqsel_internal ->
var_eq_const, where the vardata holds info about x[1].

The case expression goes through a slightly different flow
(clause_selectivity_ext -> boolvarsel -> ...) and is matched on the entire
expression.

I wonder if it would make sense to first check for if there's a valid stat
data on the expression in its entirety before
jumping to the restriction selectivity on the variable itself, as there's
nothing preventing users from defining such an extended statistic.

The below naive implementation works, for instance (clearly, I'm not versed
in the source code, this is for demonstration purposes only):

---
src/backend/optimizer/path/clausesel.c | 20 +++++++++++++++-----
1 file changed, 15 insertions(+), 5 deletions(-)

diff --git a/src/backend/optimizer/path/clausesel.c
b/src/backend/optimizer/path/clausesel.c
index 06f836308d..5e03d21dc0 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -871,11 +871,21 @@ clause_selectivity_ext(PlannerInfo *root,
}
else
{
- /* Estimate selectivity for a restriction clause. */
- s1 = restriction_selectivity(root, opno,
- opclause->args,
- opclause->inputcollid,
- varRelid);
+ VariableStatData vardata;
+
+ examine_variable(root, clause, varRelid, &vardata);
+ if (HeapTupleIsValid(vardata.statsTuple))
+ {
+ /* Try estimating selectivity based on the entire
expression first */
+ s1 = boolvarsel(root, clause, varRelid);
+ } else {
+ /* There's no expressional statistic on the restriction
clause - fallback to estimating restriction selectivity for the given node
*/
+ s1 = restriction_selectivity(root, opno,
+ opclause->args,
+ opclause->inputcollid,
+ varRelid);
+ }
+ ReleaseVariableStats(vardata);
}

/*
--

2. Less important, just a minor note - feel free to ignore - although the
eq. operator above seems to be skipped when matching the ext. stats, I can
work around this by using a CASE expression (fiddle
<https://www.db-fiddle.com/f/wJZNH1rNwJSo3D5aByQiWX/1>);
Building on the above example, we can:
*create statistics s2 on (case x[1] when 1 then true else false end) from
t1;*
*explain analyze select * from t1 where (case x[1] when 1 then true else
false end*
*> Seq Scan on t1 (cost=0.00..1986.00 rows=100000 width=25) (actual
time=0.011..33.721 rows=100000 loops=1)*

What's a bit problematic here, though, is that if we mix other dependent
columns to the extended stat, and specifically if we create an mcv,
queries involving the CASE expression throw with `error: unknown clause
type 130`, where clause type == T_CaseExpr.

The second point for me would be that I've found it a bit non intuitive
that creating an extended statistic can fail queries at query time; it
makes sense that the mcv wouldn't work for case expressions, but it
might've been a bit clearer to:

a. Fail this at statistic creation time, potentially, or
b. Convert the type numeric in the above error to its text representation,
if we can extract it out at runtime somehow -
I couldn't find a mapping of clause type numerics to their names, and as
the node tags are generated at compile time, it could be build-dependent
and a bit hard to track down if one doesn't control the build flags

Thanks a ton for your help - appreciate your time,
Danny

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Melih Mutlu 2022-08-05 13:55:09 Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication
Previous Message Bharath Rupireddy 2022-08-05 13:02:39 Re: Use fadvise in wal replay