Re: Collect frequency statistics for arrays

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Noah Misch <noah(at)leadboat(dot)com>, Nathan Boley <npboley(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Collect frequency statistics for arrays
Date: 2012-03-02 16:54:28
Message-ID: 6098.1330707268@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Still working through this patch ... there are some things that bother
me about the entries being made in pg_statistic:

1. You re-used STATISTIC_KIND_MCELEM for something that, while similar
to tsvector's usage, is not the same. In particular, tsvector adds two
extra elements to the stanumbers array, but this patch adds four (and
doesn't bother documenting that in pg_statistic.h). I don't think it's
acceptable to re-use the same stakind value for something that's not
following the same specification. I see Nathan complained of this way,
way upthread, but nothing was done about it.

I think we should either assign a different stakind code for this
definition, or change things around so that tsvector actually is using
the same stats kind definition as arrays are. (We could get away with
redefining the tsvector stats format, because pg_upgrade doesn't try to
copy pg_statistic rows to the new database.) Now, of the two new
elements added by the patch, it seems to me to be perfectly reasonable
to add a null-element frequency to the kind specification; the fact that
it wasn't there to start with is kind of an oversight born of the fact
that tsvectors don't contain any null lexemes. But the other new
element is the average distinct element count, which really does not
belong here at all, as it is *entirely* unrelated to element
frequencies. It seems to me that that more nearly belongs in the
element-count histogram slot. So my preference is to align the two
definitions of STATISTIC_KIND_MCELEM by adding a null-element frequency
to tsvector's usage (where it'll always be zero) and getting rid of the
average distinct element count here.

2. I think STATISTIC_KIND_LENGTH_HISTOGRAM is badly named and
confusingly documented. The stats are not about anything I would call a
"length" --- rather we're considering the counts of numbers of distinct
element values present in each array value. An ideal name perhaps would
be STATISTIC_KIND_DISTINCT_ELEMENTS_COUNT_HISTOGRAM, but of course
that's unreasonably long. Considering the way that the existing stats
kind names are abbreviated, maybe STATISTIC_KIND_DECHIST would do.
Anybody have a better idea?

3. I also find it a bit odd that you chose to store the length (count)
histogram as an integer array in stavalues. Usually we've put such data
in stanumbers. That would make the entries float4 not integer, but that
doesn't seem unreasonable to me --- values would still be exact up to
2^24 or so on typical machines, and if we ever do have values larger
than that, it seems to me that having headroom to go above 2^32 would
be a good thing. In any case, if we're going to move the average
distinct-element count over here, that would have to go into stanumbers.

Comments?

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2012-03-02 16:58:01 Re: 16-bit page checksums for 9.2
Previous Message Noah Misch 2012-03-02 16:41:05 Re: ECPG FETCH readahead