Re: multivariate statistics (v19)

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: multivariate statistics (v19)
Date: 2017-02-02 08:59:13
Message-ID: 90fd1100-1886-cb4b-10ee-c556ac6c3d12@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 02/01/2017 11:52 PM, Alvaro Herrera wrote:
> Still looking at 0002.
>
> pg_ndistinct_in disallows input, claiming that pg_node_tree does the
> same thing. But pg_node_tree does it for security reasons: you could
> crash the backend if you supplied a malicious value. I don't think
> that applies to pg_ndistinct_in. Perhaps it will be useful to inject
> fake stats at some point, so why not allow it? It shouldn't be
> complicated (though it does require writing some additional code, so
> perhaps that's one reason we don't want to allow input of these
> values).
>

Yes, I haven't written the code, and I'm not sure it's a very practical
way to inject custom statistics. But if we decide to allow that in the
future, we can probably add the code.

There's a subtle difference between pg_node_tree and the data types for
statistics - pg_node_tree stores the value as a string (matching the
nodeToString output), so the _in function is fairly simple. Of course,
stringToNode() assumes safe input, which is why the input is disabled.

OTOH the statistics are stored in an optimized binary format, allowing
to use the value directly (without having to do expensive parsing etc).

I was thinking that the easiest way to add support for _in would be to
add a bunch of Nodes for the statistics, along with in/out functions,
but keeping the internal binary representation. But that'll be tricky to
do in a safe way - even if those nodes are coded in a very defensive
ways, I'd bet there'll be ways to inject unsafe nodes.

So I'm OK with not having the _in for now. If needed, it's possible to
construct the statistics as a bytea using a bit of C code. That's at
least obviously unsafe, as anything written in C, touching the memory.

> The comment on top of pg_ndistinct_out is missing the "_out"; also it
> talks about histograms, which is not what this is about.
>

OK, will fix.

> In the same function, a trivial point you don't need to pstrdup() the
> .data out of a stringinfo; it's already palloc'ed in the right context
> -- just PG_RETURN_CSTRING(str.data) and forget about "ret". Saves you
> one line.
>

Will fix too.

> Nearby, some auxiliary functions such as n_choose_k and
> num_combinations are not documented. What it is that they do? I'd
> move these at the end of the file, keeping the important entry points
> at the top of the file.

I'd say n-choose-k is pretty widely known term from combinatorics. The
comment would essentially say just 'this is n-choose-k' which seems
rather pointless. So as much as I dislike the self-documenting code,
this actually seems like a good case of that.

> I see this patch has a estimate_ndistinct() which claims to be a re-
> implementation of code already in analyze.c, but it is actually a lot
> simpler than what analyze.c does. I've been wondering if it'd be a good
> idea to use some of this code so that some routines are moved out of
> analyze.c; good implementations of statistics-related functions would
> live in src/backend/statistics/ where they can be used both by analyze.c
> and your new mvstats stuff. (More generally I am beginning to wonder if
> the new directory should be just src/backend/statistics.)
>

I'll look into that. I have to check if I ignored some assumptions or
corner cases the analyze.c deals with.

> common.h does not belong in src/backend/utils/mvstats; IMO it should be
> called src/include/utils/mvstat.h. Also, it must not include
> postgres.h, and it probably doesn't need most of the #includes it has;
> those are better put into whatever include it. It definitely needs a
> guarding #ifdef MVSTATS_H around its whole content too. An include file
> is not just a way to avoid #includes in other files; it is supposed to
> be a minimally invasive way of exporting the structs and functions
> implemented in some file into other files. So it must be kept minimal.
>

Will do.

> psql/tab-complete.c compares the wrong version number (9.6 instead of
> 10).
>
> Is it important to have a cast from pg_ndistinct to bytea? I think
> it's odd that outputting it as bytea yields something completely
> different than as text. (The bytea is not human readable and cannot be
> used for future input, so what is the point?)
>

Because it internally is a bytea, and it seems useful to have the
ability to inspect the bytea value directly (e.g. to see the length of
the bytea and not the string output).

>
> In another subthread you seem to have surrendered to the opinion that
> the new catalog should be called pg_statistics_ext, just in case in the
> future we come up with additional things to put on it. However, given
> its schema, with a "starelid / stakeys", is it sensible to think that
> we're going to get anything other than something that involves multiple
> variables? Maybe it should just be "pg_statistics_multivar" and if
> something else comes along we create another catalog with an appropriate
> schema. Heck, how does this catalog serve the purpose of cross-table
> statistics in the first place, given that it has room to record a single
> relid only? Are you thinking that in the future you'd change starelid
> into an oidvector column?
>

Yes, I think the starelid will turn into OID vector. The reason why I
haven't done that in the current version of the catalog is to keep it
simple. Supporting join statistics will require tracking OID for each
attribute, because those will be from multiple relations. It'll also
require tracking "join condition" and so on.

We've designed the CREATED STATISTICS syntax to support this extension,
but I'm strongly against complicating the catalogs at this point.

> The comment in gram.y about the CREATE STATISTICS is at odds with what
> is actually allowed by the grammar.
>

Which comment?

> I think the name of a statistics is only useful to DROP/ALTER it, right?
> I wonder why it's useful that statistics belongs in a schema. Perhaps
> it should be a global object? I suppose the name collisions would
> become bothersome if you have many mvstats.
>

I think it shouldn't be a global object. I consider them to be a part of
a schema (just like indexes, for example). Imagine you have a
multi-tenant database, with using exactly the same (tables/indexes)
schema, but keept in different schemas. Why shouldn't it be possible to
also use the same set of statistics for each tenant?

T.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2017-02-02 10:17:21 Re: Superowners
Previous Message Heikki Linnakangas 2017-02-02 08:45:17 Re: pg_authid.rolpassword format (was Re: Password identifiers, protocol aging and SCRAM protocol)