Re: multivariate statistics (v19)

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, ideriha(dot)takeshi(at)jp(dot)fujitsu(dot)com
Cc: dilipbalaut(at)gmail(dot)com, Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp, dean(dot)a(dot)rasheed(at)gmail(dot)com, hlinnaka(at)iki(dot)fi, robertmhaas(at)gmail(dot)com, ishii(at)postgresql(dot)org, david(at)pgmasters(dot)net, michael(dot)paquier(at)gmail(dot)com, alvherre(at)2ndquadrant(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, petr(at)2ndquadrant(dot)com, jeff(dot)janes(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: multivariate statistics (v19)
Date: 2017-01-30 19:12:35
Message-ID: 1fa982dc-ffb4-eef7-eb55-d8cf7d84f1c4@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi everyone,

thanks for the reviews! Attached is v23 of the patch series, addressing
most of the points raised in the reviews.

A quick summary of the changes (I'll respond to the other threads for
points that deserve a bit more detailed discussion):

0) Rebase to current master. The main culprit was the pesky logical
replication patch committed a week ago, because SUBSCRIPTION and
STATISTICS are right next to each other in gram.y, various switches etc.

1) Many typos, mentioned by all the reviewers.

2) I've added a short explanation (in alter_table.sgml) of how ALTER
TABLE ... DROP COLUMN handles multivariate statistics, i.e. that those
are only dropped if there would be a single remaining column.

3) I've reworded 'thoroughly' to 'in more detail' in planstats.sgml, to
make Petr happy ;-)

4) Added missing comments to get_statistics_oid, RelationGetMVStatList,
update_mv_stats, ndistinct_for_combination. Also update_mv_stats() was
not used outside common.c, so I've made it static and removed the
prototype from mvstats.h.

5) I've changed 'statistics does not exist' to 'statistics do not exist'
on a number of places.

6) Removed XXX about checking for duplicates in CreateStatistics. I
agree with Petr that we shouldn't do such checks, as we're not doing
that for other objects (e.g. indexes).

7) I've moved moved the code loading statistics from get_relation_info
into a new function get_relation_statistics, to get rid of the

if (true)
{
...
}

block, which was there due to mimicking how index details are loaded
without having hasindex-like flag. I like this better than merging the
block into get_relation_info directly.

8) I've changed 'a statistics' to 'multivariate statistics' on a few
places in sgml docs, to make it clear it's not referring to the
'regular' statistics (e.g. at CREATE/DROP STATISTICS, mentioned by
Ideriha Takeshi).

9) I've changed the link in README.dependencies to
https://en.wikipedia.org/wiki/Functional_dependency as proposed by
Ideriha Takeshi. I'm pretty sure the wiki page about database
normalization, referenced by the original link, included a nice
functional dependency example some time ago, but it seems to have
changed and the new link is better.

But perhaps it's not a good idea to link to wikipedia, as the pages
clearly change quite significantly?

10) The CREATE STATISTICS now reports a nice 'already exists' message,
instead of the 'duplicate key', pointed out by Dilip.

11) MVNDistinctItem/MVNDistinctData now use FLEXIBLE_ARRAY_MEMBER for
the array, just like the other structs.

On 01/26/2017 12:01 PM, Kyotaro HORIGUCHI wrote:
> dependencies.c:
>
> dependency_dgree():
>
> - The k is assumed larger than 1. I think assertion is required.
>
> - "/* end of the preceding group */" seems to be better if it
> is just after the "if (multi_sort.." currently just after it.
>
> - The following comment seems mis-edited.
> > * If there is a single are no contradicting rows, count the group
> > * as supporting, otherwise contradicting.
>
> maybe this would be like the following? The varialbe counting
> the first "contradiction" is named "n_violations". This seems
> somewhat confusing.
>
> > * If there are no violating rows up to here, count the group
> > * as supporting, otherwise contradicting.
>
> - "/* first columns match, but the last one does not"
> else if (multi_sort_compare_dims((k - 1), (k - 1), ...
>
> The above comparison should use multi_sort_compare_dim, not
> dims
>
> - This function counts "n_contradicting_rows" but it is not
> referenced. Anyway n_contradicting_rows = numrows -
> n_supporing_rows so it and n_contradicting seem
> unncecessary.
>

Yes, absolutely. This was clearly unnecessary remainder of the original
implementation, and I failed to clean it up after adopting Dean's idea
of continuous dependency degree.

I've also reworked the method a bit, moving handling of the last group
into the main loop (instead of doing that separately right after the
loop, which I think was a bit ugly anyway). Can you check if you're
happy with the code & comments now?

>
> mvstats.h:
>
> - struct MVDependencyData/ MVDependenciesData
>
> The varialbe length member at the last of the structs should
> be defined using FLEXIBLE_ARRAY_MEMBER, from the convention.
>

Yes, fixed. The other structures already used that macro, but I failed
to notice MVDependencyData/ MVDependenciesData need that fix too.

>
> - I'm not sure how much it impacts performance, but some
> struct members seems to have a bit too wide types. For
> example, MVDepedenciesData.type is of int32 but it can have
> only '1' for now and it won't be two-digits. Also ndeps
> cannot be so large.
>

I doubt the impact on performance is measurable, particularly for the
global fields (e.g. nbuckets is tiny compared to the space needed for
the buckets themselves).

But I think you're right we shouldn't use fields wider than actually
needed (e.g. using uint32 for nbuckets is a bit insane, and uint16 would
be just fine). It's not just a matter of performance, but also a way to
document expected values etc.

I'll go through the fields and use smaller data types where appropriate.

>
> general:
> This patch uses int16 as the type of attrubute number but it
> might be better to use AttrNumber for the purpose.
> (Specifically it seems defined as the type for an attribute
> index but also used as the varialbe for number of attributes)
>

Agreed. Will check with the struct members.

>
> Sorry for the random comment in advance. I'll learn this further.
>

Thanks for the review!

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
0001-teach-pull_-varno-varattno-_walker-about-Restric-v23.patch binary/octet-stream 1.4 KB
0002-PATCH-shared-infrastructure-and-ndistinct-coeffi-v23.patch binary/octet-stream 141.9 KB
0003-PATCH-functional-dependencies-only-the-ANALYZE-p-v23.patch binary/octet-stream 67.0 KB
0004-PATCH-selectivity-estimation-using-functional-de-v23.patch binary/octet-stream 46.5 KB
0005-PATCH-multivariate-MCV-lists-v23.patch binary/octet-stream 122.3 KB
0006-PATCH-multivariate-histograms-v23.patch binary/octet-stream 149.7 KB
0007-WIP-use-ndistinct-for-selectivity-estimation-in--v23.patch binary/octet-stream 14.6 KB
0008-WIP-allow-using-multiple-statistics-in-clauselis-v23.patch binary/octet-stream 9.3 KB
0009-WIP-psql-tab-completion-basics-v23.patch binary/octet-stream 3.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2017-01-30 19:16:40 Re: VACUUM's ancillary tasks
Previous Message Nikhil Sontakke 2017-01-30 18:45:33 Re: Speedup twophase transactions