Re: multivariate statistics / patch v7

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: pgsql-hackers(at)postgresql(dot)org, jeff(dot)janes(at)gmail(dot)com, sfrost(at)snowman(dot)net
Subject: Re: multivariate statistics / patch v7
Date: 2015-07-25 21:09:31
Message-ID: 55B3FB0B.7000201@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 07/16/2015 01:51 PM, Kyotaro HORIGUCHI wrote:
> Hi, I'd like to show you the modified constitution of
> multivariate statistics application logic. Please find the
> attached. They apply on your v7 patch.

Sadly I do have some trouble getting it to apply correctly :-(
So for now all my comments are based on just reading the code.

FWIW I've rebased my patch to the current master, it's available on
github as usual:

https://github.com/tvondra/postgres/commits/mvstats

> The code to find mv-applicable clause is moved out of the main
> flow of clauselist_selectivity. As I said in the previous mail,
> the new function transformRestrictInfoForEstimate (too bad name
> but just for PoC:) scans clauselist and generates
> RestrictStatsData struct which drives mv-aware selectivity
> calculation. This struct isolates MV and non-MV estimation.
>
> The struct RestrictStatData mainly consists of the following
> three parts,
>
> - clause to be estimated by current logic (MV is not applicable)
> - clause to be estimated by MV-staistics.
> - list of child RestrictStatDatas, which are to be run
> recursively.
>
> mvclause_selectivty() is the topmost function where mv stats
> works. This structure effectively prevents main estimation flow
> from being broken by modifying mvstats part. Although I haven't
> measured but I'm positive the code is far reduced from yours.
>
> I attached two patches to this message. The first one is to
> rebase v7 patch to current(maybe) master and the second applies
> the refactoring.
>
> I'm a little anxious about performance but I think this makes the
> process to apply mv-stats far clearer. Regtests for mvstats
> succeeded asis except for fdep, which is not implememted in this
> patch.
>
> What do you think about this?

I'm not sure, at this point. I'm having a hard time understanding how
exactly the code works - there are pretty much no comments explaining
the implementation, so it takes time to understand the code. This is
especially true about transformRestrictInfoForEstimate which is also
quite long. I understand it's a PoC, but comments would really help.

On a conceptual level, I think the idea to split the estimation into two
phases - enrich the expression tree with nodes with details about stats
etc, and then actually do the estimation in the second phase might be
interesting. Not because it's somehow clearer, but because it gives us a
chance to see the expression tree as a whole, with details about all the
stats (with the current code we process/estimate the tree
incrementally). But I don't really know how useful that would be.

I don't think the proposed change makes the process somehow clearer. I
know it's a PoC at this point, so I don't expect it to be perfect, but
for me the original code is certainly clearer. Of course, I'm biased as
I wrote the current code, and I (naturally) shaped it to match my ideas
during the development process, and I'm much more familiar with it.

Omitting the support for functional dependencies is a bit unfortunate, I
think. Is that merely to make the PoC simpler, or is there something
that makes it impossible to support that kind of stats?

Another thing that I noticed is that you completely removed the code
that combined multiple stats (and selected the best combination of
stats). In other words, you've reverted to the intermediate single
statistics approach, including removing the improved handling of OR
clauses and conditions. It's a bit difficult to judge the proposed
approach not knowing how well it supports those (quite crucial)
features. What if it can't support some them., or what if it makes the
code much more complicated (thus defeating the goal of making it more
clear)?

I share your concern about the performance impact - one thing is that
this new code might be slower than the original one, but a more serious
issue IMHO is that the performance impact will happen even for relations
with no multivariate stats at all. The original patch was very careful
about getting ~0% overhead in such cases, and if the new code does not
allow that, I don't see this approach as acceptable. We must not put
additional overhead on people not using multivariate stats.

But I think it's worth exploring this idea a bit more - can you rebase
it to the current patch version (as on github) and adding the missing
pieces (functional dependencies, multi-statistics estimation and passing
conditions)?

One more thing - I noticed you extended the pg_operator catalog with a
oprmvstat attribute, used to flag operators that are compatible with
multivariate stats. I'm not happy with the current approach (using
oprrest to do this decision), but I'm not really sure this is a good
solution either. The culprit is that it only answers one of the two
important questions - Is it compatible? How to perform the estimation?

So we'd have to rely on oprrest anyway, when actually performing the
estimation of a clause with "compatible" operator. And we'd have to keep
in sync two places (catalog and checks in file), and we'd have to update
the catalog after improving the implementation (adding support for
another operator).

kind regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kevin Grittner 2015-07-25 22:26:55 markup problems in row_security GUC docs
Previous Message Andrew Dunstan 2015-07-25 20:55:32 Re: Debugging buildfarm pg_upgrade check failures