Re: multivariate statistics (v19)

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tatsuo Ishii <ishii(at)postgresql(dot)org>, David Steele <david(at)pgmasters(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multivariate statistics (v19)
Date: 2017-01-30 20:00:25
Message-ID: 92306c1a-6a81-a5e9-6283-e5438f455f16@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 01/30/2017 05:12 PM, Alvaro Herrera wrote:
>
> Hmm. So we have a catalog pg_mv_statistics which stores two things:
> 1. the configuration regarding mvstats that have been requested by user
> via CREATE/ALTER STATISTICS
> 2. the actual values captured from the above, via ANALYZE
>
> I think this conflates two things that really are separate, given their
> different timings and usage patterns. This decision is causing the
> catalog to have columns enabled/built flags for each set of stats
> requested, which looks a bit odd. In particular, the fact that you have
> to heap_update the catalog in order to add more stuff as it's built
> looks inconvenient.
>
> Have you thought about having the "requested" bits be separate from the
> actual computed values? Something like
>
> pg_mv_statistics
> starelid
> staname
> stanamespace
> staowner -- all the above as currently
> staenabled array of "char" {d,f,s}
> stakeys
> // no CATALOG_VARLEN here
>
> where each char in the staenabled array has a #define and indicates one
> type, "ndistinct", "functional dep", "selectivity" etc.
>
> The actual values computed by ANALYZE would live in a catalog like:
>
> pg_mv_statistics_values
> stvstaid -- OID of the corresponding pg_mv_statistics row. Needed?

Definitely needed. How else would you know which MCV list and histogram
belong together? This works just like in pg_statistic - when both MCV
and histograms are enabled for the statistic, we first build MCV list,
then histogram on remaining rows. So we need to pair them.

> stvrelid -- same as starelid
> stvkeys -- same as stakeys
> #ifdef CATALOG_VARLEN
> stvkind 'd' or 'f' or 's', etc
> stvvalue the bytea blob
> #endif
>
> I think that would be simpler, both conceptually and in terms of code.

I think the main issue here is that it throws away the special data
types (pg_histogram, pg_mcv, pg_ndistinct, pg_dependencies), which I
think is a neat idea and would like to keep it. This would throw that
away, making everything bytea again. I don't like that.

>
> The other angle to consider is planner-side: how does the planner gets
> to the values? I think as far as the planner goes, the first catalog
> doesn't matter at all, because a statistics type that has been enabled
> but not computed is not interesting at all; planner only cares about the
> values in the second catalog (this is why I added stvkeys). Currently
> you're just caching a single pg_mv_statistics row in get_relation_info
> (and only if any of the "built" flags is set), which is simple. With my
> proposed change, you'd need to keep multiple pg_mv_statistics_values
> rows.
>
> But maybe you already tried something like what I propose and there's a
> reason not to do it?
>

Honestly, I don't see how this improves the situation. We still need to
cache data for exactly one catalog, so how is that simpler?

The way I see it, it actually makes things more complicated, because now
we have two catalogs to manage instead of one (e.g. when doing DROP
STATISTICS, or after ALTER TABLE ... DROP COLUMN).

The 'built' flags may be easily replaced with a check if the bytea-like
columns are NULL, and the 'enabled' columns may be replaced by the array
of char, just like you proposed.

That'd give us a single catalog looking like this:

pg_mv_statistics
starelid
staname
stanamespace
staowner -- all the above as currently
staenabled array of "char" {d,f,s}
stakeys
stadeps (dependencies)
standist (ndistinct coefficients)
stamcv (MCV list)
stahist (histogram)

Which is probably a better / simpler structure than the current one.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2017-01-30 20:08:29 Re: [COMMITTERS] pgsql: test_pg_dump TAP test whitespace cleanup
Previous Message David Steele 2017-01-30 19:49:49 Re: patch proposal