Re: pg_upgrade and statistics

From: Ants Aasma <ants(at)cybertec(dot)at>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Greg Stark <stark(at)mit(dot)edu>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Bruce Momjian <bruce(at)momjian(dot)us>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Kevin Grittner <kevin(dot)grittner(at)wicourts(dot)gov>, Daniel Farina <daniel(at)heroku(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: pg_upgrade and statistics
Date: 2012-03-15 20:34:45
Message-ID: CA+CSw_tEYivSXMLqr7j0S298Mgg1FciLVfnSd=DgTEHt-UnDNQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Mar 15, 2012 at 8:48 PM, Alvaro Herrera <alvherre(at)commandprompt(dot)com>
> What Peter proposed seems to me pretty reasonable, in the sense that it
> should be possible to come up with a function that creates some text
> representation of whatever is in pg_statistic, and another function to
> load that data into the new catalog(s).  There's no need to keep
> pg_statistic binary-compatible, or even continue to have only
> pg_statistic (IIRC Zoltan/Hans-Jurgen patch for cross-column stats adds
> a new catalog, pg_statistic2 or similar).  If the upgrade target release
> has room for more/improved stats, that's fine -- they'll be unused after
> loading the stats from the dump.  And the old stats can be
> reacommodated, if necessary.

I have been reading up on selectivity estimation research for the last
few days. I must say that I also think that having a text
representation as an intermediate won't create a huge maintenance
burden. The basic concepts that are there are pretty solid.
Conceptually MCV's and histograms continue to be essential even with
the more complex approaches. Trying to maintain binary compatibility
is probably a bad idea, as Tom noted with the array selectivity patch
- encoding of the information could be better. But given a textual
format it won't be too hard to just massage the data to the new
format. Making it possible to dump and load stats has the additional
bonus of enabling more experimentation with custom stats collectors.
One could easily prototype the stats collection with R, scipy, etc. Of
course the proof will be in the pudding.

Re, the patch, current posted WIP cross-col patch doesn't create a new
catalog,. It repurposes the stat slots mechanism to store multiple
dimensions. But I'll most likely rewrite it to use a separate catalog
because the storage requirements are rather different. I'll post a
proposal in the appropriate thread when I have decently clear idea how
this should work. One thing that seems clear is that multi-dimensional
histograms will want this mechanism even more, optimal histogram
construction is NP-hard in the multi-dimensional case and so people
will want to try different algorithms, or make different tradeoffs on
effort spent on constructing the histogram. Or even build one by hand.

Cheers,
Ants Aasma

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2012-03-15 20:45:58 Re: [v9.2] Add GUC sepgsql.client_label
Previous Message Bruce Momjian 2012-03-15 20:14:05 Re: Keystone auth in PostgreSQL