Re: multivariate statistics / patch v7

From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: tomas(dot)vondra(at)2ndquadrant(dot)com
Cc: pgsql-hackers(at)postgresql(dot)org, jeff(dot)janes(at)gmail(dot)com, sfrost(at)snowman(dot)net
Subject: Re: multivariate statistics / patch v7
Date: 2015-07-16 11:51:57
Message-ID: 20150716.205157.170797028.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi, I'd like to show you the modified constitution of
multivariate statistics application logic. Please find the
attached. They apply on your v7 patch.

The code to find mv-applicable clause is moved out of the main
flow of clauselist_selectivity. As I said in the previous mail,
the new function transformRestrictInfoForEstimate (too bad name
but just for PoC:) scans clauselist and generates
RestrictStatsData struct which drives mv-aware selectivity
calculation. This struct isolates MV and non-MV estimation.

The struct RestrictStatData mainly consists of the following
three parts,

- clause to be estimated by current logic (MV is not applicable)
- clause to be estimated by MV-staistics.
- list of child RestrictStatDatas, which are to be run
recursively.

mvclause_selectivty() is the topmost function where mv stats
works. This structure effectively prevents main estimation flow
from being broken by modifying mvstats part. Although I haven't
measured but I'm positive the code is far reduced from yours.

I attached two patches to this message. The first one is to
rebase v7 patch to current(maybe) master and the second applies
the refactoring.

I'm a little anxious about performance but I think this makes the
process to apply mv-stats far clearer. Regtests for mvstats
succeeded asis except for fdep, which is not implememted in this
patch.

What do you think about this?

regards,

> Hi, Thanks for the detailed explaination. I misunderstood the
> code (more honest speaking, din't look so close there). Then I
> looked it closer.
>
>
> At Wed, 08 Jul 2015 03:03:16 +0200, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote in <559C76D4(dot)2030805(at)2ndquadrant(dot)com>
> > FWIW this was a stupid bug in update_match_bitmap_histogram(), which
> > initially handled only AND clauses, and thus assumed the "match" of a
> > bucket can only decrease. But for OR clauses this is exactly the
> > opposite (we assume no buckets match and add buckets matching at least
> > one of the clauses).
> >
> > With this fixed, the estimates look like this:
> >
>
> > IMHO pretty accurate estimates - no issue with OR clauses.
>
> Ok, I understood the diferrence between what I thought and what
> you say. The code is actually concious of OR clause but is looks
> somewhat confused.
>
> Currently choosing mv stats in clauselist_selectivity can be
> outlined as following,
>
> 1. find_stats finds candidate mv stats containing *all*
> attributes appeared in the whole clauses regardless of and/or
> exprs by walking whole the clause tree.
>
> Perhaps this is the measure to early bailout.
>
> 2.1. Within every disjunction elements, collect mv-related
> attributes while checking whether the all leaf nodes (binop or
> ifnull) are compatible by (eventually) walking whole the
> clause tree.
>
> 2.2. Check if all the collected attribute are contained in
> mv-stats columns.
>
> 3. Finally, clauseset_mv_selectivity_histogram() (and others).
>
> This funciton applies every ExprOp onto every attribute in
> every histogram backes and (tries to) make the boolean
> operation of the result bitmaps.
>
> I have some comments on the implement and I also try to find the
> solution for them.
>
>
> 1. The flow above looks doing very similiar thins repeatedly.
>
> 2. I believe what the current code does can be simplified.
>
> 3. As you mentioned in comments, some additional infrastructure
> needed.
>
> After all, I think what we should do after this are as follows,
> as the first step.
>
> - Add the means to judge the selectivity operator(?) by other
> than oprrest of the op of ExprOp. (You missed neqsel already)
>
> I suppose one solution for this is adding oprmvstats taking
> 'm', 'h' and 'f' and their combinations. Or for the
> convenience, it would be a fixed-length string like this.
>
> oprname | oprmvstats
> = | 'mhf'
> <> | 'mhf'
> < | 'mh-'
> > | 'mh-'
> >= | 'mh-'
> <= | 'mh-'
>
> This would make the code in clause_is_mv_compatible like this.
>
> > oprmvstats = get_mvstatsset(expr->opno); /* bitwise representation */
> > if (oprmvstats & types)
> > {
> > *attnums = bms_add_member(*attnums, var->varattno);
> > return true;
> > }
> > return false;
>
> - Current design just manage to work but it is too complicated
> and hardly have affinity with the existing estimation
> framework. I proposed separation of finding stats phase and
> calculation phase, but I would like to propose transforming
> RestrictInfo(and finding mvstat) phase and running the
> transformed RestrictInfo phase after looking close to the
> patch.
>
> I think transforing RestrictInfo makes the situnation
> better. Since it nedds different information, maybe it is
> better to have new struct, say, RestrictInfoForEstimate
> (boo!). Then provide mvstatssel() to use in the new struct.
> The rough looking of the code would be like below.
>
> clauselist_selectivity()
> {
> ...
> RestrictInfoForEstmate *esclause =
> transformClauseListForEstimation(root, clauses, varRelid);
> ...
>
> return clause_selectivity(esclause):
> }
>
> clause_selectivity(RestrictInfoForEstmate *esclause)
> {
> if (IsA(clause, RestrictInfo))...
> if (IsA(clause, RestrictInfoForEstimate))
> {
> RestrictInfoForEstimate *ecl = (RestrictInfoForEstimate*) clause;
> if (ecl->selfunc)
> {
> sx = ecl->selfunc(root, ecl);
> }
> }
> if (IsA(clause, Var))...
> }
>
>
> transformClauseListForEstimation(...)
> {
> ...
>
> relid = collect_mvstats_info(root, clause, &attlist);
> if (!relid) return;
> if (get_mvstats_hook)
> mvstats = (*get_mvstats_hoook) (root, relid, attset);
> else
> mvstats = find_mv_stats(root, relid, attset))
> }
> ...
>
> > I've pushed this to github [1] but I need to do some additional
> > fixes. I also had to remove some optimizations while fixing this, and
> > will have to reimplement those.
> >
> > That's not to say that the handling of OR-clauses is perfectly
> > correct. After looking at clauselist_selectivity_or(), I believe it's
> > a bit broken and will need a bunch of fixes, as explained in the
> > FIXMEs I pushed to github.
> >
> > [1] https://github.com/tvondra/postgres/tree/mvstats
>
> I don't see whether it is doable or not, and I suppose you're
> unwilling to change the big picture, so I will consider the idea
> and will show you the result, if it turns out to be possible and
> promising.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment Content-Type Size
0001-rebase-v7-patch-to-current-master.patch text/x-patch 2.5 KB
0002-PoC-Planner-part-refactoring-of-mv-stats-facility.patch text/x-patch 326.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Petr Jelinek 2015-07-16 11:55:47 Re: TABLESAMPLE patch is really in pretty sad shape
Previous Message Sawada Masahiko 2015-07-16 11:51:09 Re: Freeze avoidance of very large table.