Re: multivariate statistics / patch v7

From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: tomas(dot)vondra(at)2ndquadrant(dot)com
Cc: pgsql-hackers(at)postgresql(dot)org, jeff(dot)janes(at)gmail(dot)com, sfrost(at)snowman(dot)net
Subject: Re: multivariate statistics / patch v7
Date: 2015-07-13 08:51:44
Message-ID: 20150713.175144.45075432.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi, Thanks for the detailed explaination. I misunderstood the
code (more honest speaking, din't look so close there). Then I
looked it closer.

At Wed, 08 Jul 2015 03:03:16 +0200, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote in <559C76D4(dot)2030805(at)2ndquadrant(dot)com>
> FWIW this was a stupid bug in update_match_bitmap_histogram(), which
> initially handled only AND clauses, and thus assumed the "match" of a
> bucket can only decrease. But for OR clauses this is exactly the
> opposite (we assume no buckets match and add buckets matching at least
> one of the clauses).
>
> With this fixed, the estimates look like this:
>

> IMHO pretty accurate estimates - no issue with OR clauses.

Ok, I understood the diferrence between what I thought and what
you say. The code is actually concious of OR clause but is looks
somewhat confused.

Currently choosing mv stats in clauselist_selectivity can be
outlined as following,

1. find_stats finds candidate mv stats containing *all*
attributes appeared in the whole clauses regardless of and/or
exprs by walking whole the clause tree.

Perhaps this is the measure to early bailout.

2.1. Within every disjunction elements, collect mv-related
attributes while checking whether the all leaf nodes (binop or
ifnull) are compatible by (eventually) walking whole the
clause tree.

2.2. Check if all the collected attribute are contained in
mv-stats columns.

3. Finally, clauseset_mv_selectivity_histogram() (and others).

This funciton applies every ExprOp onto every attribute in
every histogram backes and (tries to) make the boolean
operation of the result bitmaps.

I have some comments on the implement and I also try to find the
solution for them.

1. The flow above looks doing very similiar thins repeatedly.

2. I believe what the current code does can be simplified.

3. As you mentioned in comments, some additional infrastructure
needed.

After all, I think what we should do after this are as follows,
as the first step.

- Add the means to judge the selectivity operator(?) by other
than oprrest of the op of ExprOp. (You missed neqsel already)

I suppose one solution for this is adding oprmvstats taking
'm', 'h' and 'f' and their combinations. Or for the
convenience, it would be a fixed-length string like this.

oprname | oprmvstats
= | 'mhf'
<> | 'mhf'
< | 'mh-'
> | 'mh-'
>= | 'mh-'
<= | 'mh-'

This would make the code in clause_is_mv_compatible like this.

> oprmvstats = get_mvstatsset(expr->opno); /* bitwise representation */
> if (oprmvstats & types)
> {
> *attnums = bms_add_member(*attnums, var->varattno);
> return true;
> }
> return false;

- Current design just manage to work but it is too complicated
and hardly have affinity with the existing estimation
framework. I proposed separation of finding stats phase and
calculation phase, but I would like to propose transforming
RestrictInfo(and finding mvstat) phase and running the
transformed RestrictInfo phase after looking close to the
patch.

I think transforing RestrictInfo makes the situnation
better. Since it nedds different information, maybe it is
better to have new struct, say, RestrictInfoForEstimate
(boo!). Then provide mvstatssel() to use in the new struct.
The rough looking of the code would be like below.

clauselist_selectivity()
{
...
RestrictInfoForEstmate *esclause =
transformClauseListForEstimation(root, clauses, varRelid);
...

return clause_selectivity(esclause):
}

clause_selectivity(RestrictInfoForEstmate *esclause)
{
if (IsA(clause, RestrictInfo))...
if (IsA(clause, RestrictInfoForEstimate))
{
RestrictInfoForEstimate *ecl = (RestrictInfoForEstimate*) clause;
if (ecl->selfunc)
{
sx = ecl->selfunc(root, ecl);
}
}
if (IsA(clause, Var))...
}


transformClauseListForEstimation(...)
{
...

relid = collect_mvstats_info(root, clause, &attlist);
if (!relid) return;
if (get_mvstats_hook)
mvstats = (*get_mvstats_hoook) (root, relid, attset);
else
mvstats = find_mv_stats(root, relid, attset))
}
...

> I've pushed this to github [1] but I need to do some additional
> fixes. I also had to remove some optimizations while fixing this, and
> will have to reimplement those.
>
> That's not to say that the handling of OR-clauses is perfectly
> correct. After looking at clauselist_selectivity_or(), I believe it's
> a bit broken and will need a bunch of fixes, as explained in the
> FIXMEs I pushed to github.
>
> [1] https://github.com/tvondra/postgres/tree/mvstats

I don't see whether it is doable or not, and I suppose you're
unwilling to change the big picture, so I will consider the idea
and will show you the result, if it turns out to be possible and
promising.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Shulgin, Oleksandr 2015-07-13 09:41:04 Re: [PATCH] Generalized JSON output functions
Previous Message Michael Paquier 2015-07-13 08:46:08 Re: TABLESAMPLE patch is really in pretty sad shape