Re: multivariate statistics (v19)

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, ideriha(dot)takeshi(at)jp(dot)fujitsu(dot)com
Cc: dilipbalaut(at)gmail(dot)com, Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp, dean(dot)a(dot)rasheed(at)gmail(dot)com, hlinnaka(at)iki(dot)fi, robertmhaas(at)gmail(dot)com, ishii(at)postgresql(dot)org, david(at)pgmasters(dot)net, michael(dot)paquier(at)gmail(dot)com, alvherre(at)2ndquadrant(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, petr(at)2ndquadrant(dot)com, jeff(dot)janes(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: multivariate statistics (v19)
Date: 2017-01-30 19:12:35
Views: Raw Message | Whole Thread | Download mbox
Lists: pgsql-hackers

Hi everyone,

thanks for the reviews! Attached is v23 of the patch series, addressing
most of the points raised in the reviews.

A quick summary of the changes (I'll respond to the other threads for
points that deserve a bit more detailed discussion):

0) Rebase to current master. The main culprit was the pesky logical
replication patch committed a week ago, because SUBSCRIPTION and
STATISTICS are right next to each other in gram.y, various switches etc.

1) Many typos, mentioned by all the reviewers.

2) I've added a short explanation (in alter_table.sgml) of how ALTER
TABLE ... DROP COLUMN handles multivariate statistics, i.e. that those
are only dropped if there would be a single remaining column.

3) I've reworded 'thoroughly' to 'in more detail' in planstats.sgml, to
make Petr happy ;-)

4) Added missing comments to get_statistics_oid, RelationGetMVStatList,
update_mv_stats, ndistinct_for_combination. Also update_mv_stats() was
not used outside common.c, so I've made it static and removed the
prototype from mvstats.h.

5) I've changed 'statistics does not exist' to 'statistics do not exist'
on a number of places.

6) Removed XXX about checking for duplicates in CreateStatistics. I
agree with Petr that we shouldn't do such checks, as we're not doing
that for other objects (e.g. indexes).

7) I've moved moved the code loading statistics from get_relation_info
into a new function get_relation_statistics, to get rid of the

if (true)

block, which was there due to mimicking how index details are loaded
without having hasindex-like flag. I like this better than merging the
block into get_relation_info directly.

8) I've changed 'a statistics' to 'multivariate statistics' on a few
places in sgml docs, to make it clear it's not referring to the
'regular' statistics (e.g. at CREATE/DROP STATISTICS, mentioned by
Ideriha Takeshi).

9) I've changed the link in README.dependencies to as proposed by
Ideriha Takeshi. I'm pretty sure the wiki page about database
normalization, referenced by the original link, included a nice
functional dependency example some time ago, but it seems to have
changed and the new link is better.

But perhaps it's not a good idea to link to wikipedia, as the pages
clearly change quite significantly?

10) The CREATE STATISTICS now reports a nice 'already exists' message,
instead of the 'duplicate key', pointed out by Dilip.

11) MVNDistinctItem/MVNDistinctData now use FLEXIBLE_ARRAY_MEMBER for
the array, just like the other structs.

On 01/26/2017 12:01 PM, Kyotaro HORIGUCHI wrote:
> dependencies.c:
> dependency_dgree():
> - The k is assumed larger than 1. I think assertion is required.
> - "/* end of the preceding group */" seems to be better if it
> is just after the "if (multi_sort.." currently just after it.
> - The following comment seems mis-edited.
> > * If there is a single are no contradicting rows, count the group
> > * as supporting, otherwise contradicting.
> maybe this would be like the following? The varialbe counting
> the first "contradiction" is named "n_violations". This seems
> somewhat confusing.
> > * If there are no violating rows up to here, count the group
> > * as supporting, otherwise contradicting.
> - "/* first columns match, but the last one does not"
> else if (multi_sort_compare_dims((k - 1), (k - 1), ...
> The above comparison should use multi_sort_compare_dim, not
> dims
> - This function counts "n_contradicting_rows" but it is not
> referenced. Anyway n_contradicting_rows = numrows -
> n_supporing_rows so it and n_contradicting seem
> unncecessary.

Yes, absolutely. This was clearly unnecessary remainder of the original
implementation, and I failed to clean it up after adopting Dean's idea
of continuous dependency degree.

I've also reworked the method a bit, moving handling of the last group
into the main loop (instead of doing that separately right after the
loop, which I think was a bit ugly anyway). Can you check if you're
happy with the code & comments now?

> mvstats.h:
> - struct MVDependencyData/ MVDependenciesData
> The varialbe length member at the last of the structs should
> be defined using FLEXIBLE_ARRAY_MEMBER, from the convention.

Yes, fixed. The other structures already used that macro, but I failed
to notice MVDependencyData/ MVDependenciesData need that fix too.

> - I'm not sure how much it impacts performance, but some
> struct members seems to have a bit too wide types. For
> example, MVDepedenciesData.type is of int32 but it can have
> only '1' for now and it won't be two-digits. Also ndeps
> cannot be so large.

I doubt the impact on performance is measurable, particularly for the
global fields (e.g. nbuckets is tiny compared to the space needed for
the buckets themselves).

But I think you're right we shouldn't use fields wider than actually
needed (e.g. using uint32 for nbuckets is a bit insane, and uint16 would
be just fine). It's not just a matter of performance, but also a way to
document expected values etc.

I'll go through the fields and use smaller data types where appropriate.

> general:
> This patch uses int16 as the type of attrubute number but it
> might be better to use AttrNumber for the purpose.
> (Specifically it seems defined as the type for an attribute
> index but also used as the varialbe for number of attributes)

Agreed. Will check with the struct members.

> Sorry for the random comment in advance. I'll learn this further.

Thanks for the review!


Tomas Vondra
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
0001-teach-pull_-varno-varattno-_walker-about-Restric-v23.patch binary/octet-stream 1.4 KB
0002-PATCH-shared-infrastructure-and-ndistinct-coeffi-v23.patch binary/octet-stream 141.9 KB
0003-PATCH-functional-dependencies-only-the-ANALYZE-p-v23.patch binary/octet-stream 67.0 KB
0004-PATCH-selectivity-estimation-using-functional-de-v23.patch binary/octet-stream 46.5 KB
0005-PATCH-multivariate-MCV-lists-v23.patch binary/octet-stream 122.3 KB
0006-PATCH-multivariate-histograms-v23.patch binary/octet-stream 149.7 KB
0007-WIP-use-ndistinct-for-selectivity-estimation-in--v23.patch binary/octet-stream 14.6 KB
0008-WIP-allow-using-multiple-statistics-in-clauselis-v23.patch binary/octet-stream 9.3 KB
0009-WIP-psql-tab-completion-basics-v23.patch binary/octet-stream 3.0 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2017-01-30 19:16:40 Re: VACUUM's ancillary tasks
Previous Message Nikhil Sontakke 2017-01-30 18:45:33 Re: Speedup twophase transactions