Re: multivariate statistics (v19)

From: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tatsuo Ishii <ishii(at)postgresql(dot)org>, David Steele <david(at)pgmasters(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Álvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multivariate statistics (v19)
Date: 2017-01-26 09:43:57
Message-ID: CAFiTN-sdJmUYiARd0qBQKBXqs73096qD0Hd1QrNs=JQ3F_0QGQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jan 5, 2017 at 3:27 AM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> Thanks. Those plans match my experiments with the TPC-H data set, although
> I've been playing with the smallest scale (1GB).
>
> It's not very difficult to make the estimation error arbitrary large, e.g.
> by using perfectly correlated (identical) columns.

I have done an initial review for ndistint and histogram patches,
there are few review comments.

ndistinct
---------
1. Duplicate statistics:
postgres=# create statistics s with (ndistinct) on (a,c) from t;
2017-01-07 16:21:54.575 IST [63817] ERROR: duplicate key value
violates unique constraint "pg_mv_statistic_name_index"
2017-01-07 16:21:54.575 IST [63817] DETAIL: Key (staname,
stanamespace)=(s, 2200) already exists.
2017-01-07 16:21:54.575 IST [63817] STATEMENT: create statistics s
with (ndistinct) on (a,c) from t;
ERROR: duplicate key value violates unique constraint
"pg_mv_statistic_name_index"
DETAIL: Key (staname, stanamespace)=(s, 2200) already exists.

For duplicate statistics, I think we can check the existence of the
statistics and give more meaningful error code something statistics
"s" already exist.

2. Typo
+ /*
+ * Sort the attnums, which makes detecting duplicies somewhat
+ * easier, and it does not hurt (it does not affect the efficiency,
+ * onlike for indexes, for example).
+ */
/onlike/unlike

3. Typo
/*
* Find attnims of MV stats using the mvoid.
*/
int2vector *
find_mv_attnums(Oid mvoid, Oid *relid)

/attnims/attnums

histograms
--------------
+ if (matches[i] == MVSTATS_MATCH_FULL)
+ s += mvhist->buckets[i]->ntuples;
+ else if (matches[i] == MVSTATS_MATCH_PARTIAL)
+ s += 0.5 * mvhist->buckets[i]->ntuples;

Isn't it will be better that take some percentage of the bucket based
on the number of distinct element for partial matching buckets.

+static int
+update_match_bitmap_histogram(PlannerInfo *root, List *clauses,
+ int2vector *stakeys,
+ MVSerializedHistogram mvhist,
+ int nmatches, char *matches,
+ bool is_or)
+{
+ int i;

For each clause we are processing all the buckets, can't we use some
data structure which can make multi-dimensions information searching
faster.
Something like HTree, RTree, Maybe storing histogram in these formats
will be difficult?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2017-01-26 09:49:51 Re: Transactions involving multiple postgres foreign servers
Previous Message Kyotaro HORIGUCHI 2017-01-26 09:12:30 Re: Floating point comparison inconsistencies of the geometric types