Re: multivariate statistics (v19)

From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: multivariate statistics (v19)
Date: 2017-02-06 21:26:16
Message-ID: 20170206212616.itctsg6x7u6atbtp@alvherre.pgsql
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Tomas Vondra wrote:
> On 02/01/2017 11:52 PM, Alvaro Herrera wrote:

> > Nearby, some auxiliary functions such as n_choose_k and
> > num_combinations are not documented. What it is that they do? I'd
> > move these at the end of the file, keeping the important entry points
> > at the top of the file.
>
> I'd say n-choose-k is pretty widely known term from combinatorics. The
> comment would essentially say just 'this is n-choose-k' which seems rather
> pointless. So as much as I dislike the self-documenting code, this actually
> seems like a good case of that.

Actually, we do have such comments all over the place. I knew this as
"n sobre k", so the english name doesn't immediately ring a bell with me
until I look it up; I think the function comment could just say
"n_choose_k -- this function returns the binomial coefficient".

> > I see this patch has a estimate_ndistinct() which claims to be a re-
> > implementation of code already in analyze.c, but it is actually a lot
> > simpler than what analyze.c does. I've been wondering if it'd be a good
> > idea to use some of this code so that some routines are moved out of
> > analyze.c; good implementations of statistics-related functions would
> > live in src/backend/statistics/ where they can be used both by analyze.c
> > and your new mvstats stuff. (More generally I am beginning to wonder if
> > the new directory should be just src/backend/statistics.)
>
> I'll look into that. I have to check if I ignored some assumptions or corner
> cases the analyze.c deals with.

Maybe it's not terribly important to refactor analyze.c from the get go,
but let's give the subdir a more general name. Hence my vote for having
the subdir be "statistics" instead of "mvstats".

> > In another subthread you seem to have surrendered to the opinion that
> > the new catalog should be called pg_statistics_ext, just in case in the
> > future we come up with additional things to put on it. However, given
> > its schema, with a "starelid / stakeys", is it sensible to think that
> > we're going to get anything other than something that involves multiple
> > variables? Maybe it should just be "pg_statistics_multivar" and if
> > something else comes along we create another catalog with an appropriate
> > schema. Heck, how does this catalog serve the purpose of cross-table
> > statistics in the first place, given that it has room to record a single
> > relid only? Are you thinking that in the future you'd change starelid
> > into an oidvector column?
>
> Yes, I think the starelid will turn into OID vector. The reason why I
> haven't done that in the current version of the catalog is to keep it
> simple.

OK -- as long as we know what the way forward is, I'm good. Still, my
main point was that even if we have multiple rels, this catalog will be
about having multivariate statistics, and not different kinds of
statistical data. I would keep pg_mv_statistics, really.

> > The comment in gram.y about the CREATE STATISTICS is at odds with what
> > is actually allowed by the grammar.
>
> Which comment?

This one:
* CREATE STATISTICS stats_name ON relname (columns) WITH (options)
the production actually says:
CREATE STATISTICS any_name ON '(' columnList ')' FROM qualified_name

> > I think the name of a statistics is only useful to DROP/ALTER it, right?
> > I wonder why it's useful that statistics belongs in a schema. Perhaps
> > it should be a global object? I suppose the name collisions would
> > become bothersome if you have many mvstats.
>
> I think it shouldn't be a global object. I consider them to be a part of a
> schema (just like indexes, for example). Imagine you have a multi-tenant
> database, with using exactly the same (tables/indexes) schema, but keept in
> different schemas. Why shouldn't it be possible to also use the same set of
> statistics for each tenant?

True. Suggestion withdrawn.

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2017-02-06 21:29:59 Re: Enabling replication connections by default in pg_hba.conf
Previous Message Andres Freund 2017-02-06 21:22:51 Re: Active zombies at AIX