Re: multivariate statistics (v19)

From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: multivariate statistics (v19)
Date: 2017-02-01 22:52:48
Message-ID: 20170201225248.aajzts2on4ka4ky5@alvherre.pgsql
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Still looking at 0002.

pg_ndistinct_in disallows input, claiming that pg_node_tree does the
same thing. But pg_node_tree does it for security reasons: you could
crash the backend if you supplied a malicious value. I don't think that
applies to pg_ndistinct_in. Perhaps it will be useful to inject fake
stats at some point, so why not allow it? It shouldn't be complicated
(though it does require writing some additional code, so perhaps that's
one reason we don't want to allow input of these values).

The comment on top of pg_ndistinct_out is missing the "_out"; also it
talks about histograms, which is not what this is about.

In the same function, a trivial point you don't need to pstrdup() the
.data out of a stringinfo; it's already palloc'ed in the right context
-- just PG_RETURN_CSTRING(str.data) and forget about "ret". Saves you
one line.

Nearby, some auxiliary functions such as n_choose_k and num_combinations
are not documented. What it is that they do? I'd move these at the end
of the file, keeping the important entry points at the top of the file.

I see this patch has a estimate_ndistinct() which claims to be a re-
implementation of code already in analyze.c, but it is actually a lot
simpler than what analyze.c does. I've been wondering if it'd be a good
idea to use some of this code so that some routines are moved out of
analyze.c; good implementations of statistics-related functions would
live in src/backend/statistics/ where they can be used both by analyze.c
and your new mvstats stuff. (More generally I am beginning to wonder if
the new directory should be just src/backend/statistics.)

common.h does not belong in src/backend/utils/mvstats; IMO it should be
called src/include/utils/mvstat.h. Also, it must not include
postgres.h, and it probably doesn't need most of the #includes it has;
those are better put into whatever include it. It definitely needs a
guarding #ifdef MVSTATS_H around its whole content too. An include file
is not just a way to avoid #includes in other files; it is supposed to
be a minimally invasive way of exporting the structs and functions
implemented in some file into other files. So it must be kept minimal.

psql/tab-complete.c compares the wrong version number (9.6 instead of
10).

Is it important to have a cast from pg_ndistinct to bytea? I think
it's odd that outputting it as bytea yields something completely
different than as text. (The bytea is not human readable and cannot be
used for future input, so what is the point?)

In another subthread you seem to have surrendered to the opinion that
the new catalog should be called pg_statistics_ext, just in case in the
future we come up with additional things to put on it. However, given
its schema, with a "starelid / stakeys", is it sensible to think that
we're going to get anything other than something that involves multiple
variables? Maybe it should just be "pg_statistics_multivar" and if
something else comes along we create another catalog with an appropriate
schema. Heck, how does this catalog serve the purpose of cross-table
statistics in the first place, given that it has room to record a single
relid only? Are you thinking that in the future you'd change starelid
into an oidvector column?

The comment in gram.y about the CREATE STATISTICS is at odds with what
is actually allowed by the grammar.

I think the name of a statistics is only useful to DROP/ALTER it, right?
I wonder why it's useful that statistics belongs in a schema. Perhaps
it should be a global object? I suppose the name collisions would
become bothersome if you have many mvstats.

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Claudio Freire 2017-02-01 22:55:16 Re: Vacuum: allow usage of more than 1GB of work mem
Previous Message Tom Lane 2017-02-01 22:19:24 Re: Patch: Write Amplification Reduction Method (WARM)