Re: Abbreviated keys for Numeric

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>
Subject: Re: Abbreviated keys for Numeric
Date: 2015-03-23 17:01:27
Message-ID: CAM3SWZS2gfxcct7oyEWcNsjXzEeCkKH6rATtUPv6fb_9tFmyRQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Mar 22, 2015 at 1:01 PM, Andrew Gierth
<andrew(at)tao11(dot)riddles(dot)org(dot)uk> wrote:
> The substance of the code is unchanged from my original patch. I didn't
> add diagnostic output to numeric_abbrev_abort, see my separate post
> about the suggestion of a GUC for that.

I don't think that V2 really changed the substance, which seems to be
the implication of your remarks here. You disagreed with my decision
on NULL values - causing me to reconsider my position (so that's now
irrelevant) - and you disagreed with not including support for 32-bit
platforms. Those were the only non-stylistic changes, though. I
certainly didn't change any details of the algorithm that you
proposed, which, FWIW, I think is rather clever. I added a few
defensive assertions to the encoding/conversion routine (which I see
you've removed in V3, a long with a couple of other helpful
assertions), and restructured and expanded upon the comments, but
that's all.

You haven't really taken into my account my V2 feedback with this V3
revision. Even after you yourself specifically called out your
non-explanation of excess-44 as a possible point of confusion for
readers of your patch, you didn't change or expand upon your remarks
on that one iota.

I see that code like this (from your V1) appears in your V3, as if V2
never happened:

+/*
+ * non-fmgr interface to the comparison routine to allow sortsupport to elide
+ * the fmgr call. The saving here is small given how slow numeric
+ * comparisons are, but there's no reason not to implement it.
+ */

This comment is totally redundant at best, and misleading at worst. Of
course we have a non-fmgr interface - it's needed to make abbreviation
work. And of course *any* opclass that performs abbreviation won't get
any great saving from cases where only the fmgr interface is elided
(e.g. numeric is not the leading sorttuple attribute). Text is unusual
in that there is a small additional saving over fmgr-elision for
obscure reasons.

Ditto for comments like this, which re-appear in V3:

+/*
+ * Ordinary (non-sortsupport) comparisons follow.
+ */

Your V3 has obsolete comments here:

+ nss = palloc(sizeof(NumericSortSupport));
+
+ /*
+ * palloc a buffer for handling unaligned packed values in addition to
+ * the support struct
+ */
+ nss->buf = palloc(VARATT_SHORT_MAX + VARHDRSZ + 1);

I still don't think you should be referencing the text opclass behavior here:

+ * number. We make no attempt to estimate the cardinality of the real values,
+ * since it plays no part in the cost model here (if the abbreviation is equal,
+ * the cost of comparing equal and unequal underlying values is comparable).
+ * We discontinue even checking for abort (saving us the hashing overhead) if
+ * the estimated cardinality gets to 100k; that would be enough to support many
+ * billions of rows while doing no worse than breaking even.

This is dubious:

+#if DEC_DIGITS != 4
+#error "Numeric bases other than 10000 are no longer supported"
+#endif

Because there is a bunch of code within numeric.c that deals with the
DEC_DIGITS != 4 case. For example, this code has been within numeric.c
forever:

#if DEC_DIGITS == 4 || DEC_DIGITS == 2
static NumericDigit const_ten_data[1] = {10};
static NumericVar const_ten =
{1, 0, NUMERIC_POS, 0, NULL, const_ten_data};
#elif DEC_DIGITS == 1
static NumericDigit const_ten_data[1] = {1};
static NumericVar const_ten =
{1, 1, NUMERIC_POS, 0, NULL, const_ten_data};
#endif

As has this:

while (ndigits-- > 0)
{
#if DEC_DIGITS == 4
*digits++ = ((decdigits[i] * 10 + decdigits[i + 1]) * 10 +
decdigits[i + 2]) * 10 + decdigits[i + 3];
#elif DEC_DIGITS == 2
*digits++ = decdigits[i] * 10 + decdigits[i + 1];
#elif DEC_DIGITS == 1
*digits++ = decdigits[i];
#else
#error unsupported NBASE
#endif
i += DEC_DIGITS;
}

I tend to think that when Tom wrote this code back in 2003, he thought
it might be useful to change DEC_DIGITS on certain builds. And so, we
ought to continue to support it to the extent that we already do,
allowing these cases to opt out of abbreviation in an informed manner
(since it seems hard to make abbreviation work with DEC_DIGITS != 4
builds). In any case, you should have deleted all this code (which
there is rather a lot of) in proposing to not support DEC_DIGITS != 4
builds generally.

Attached revision, V4, incorporates some of your V3 changes. As I
mentioned, I changed my mind on the counting of non-NULL values, which
V4 reflects...but there are also comments that now make it clear why
that might be useful.

I've retained your new allocate once approach to buffer sizing, which
seems sound if a little obscure.

This abort function code (from your V1 + V3) seems misleading:

+ if (memtupcount < 10000 || nss->input_count < 10000 || !nss->estimating)
+ return false;

It's impossible for "memtupcount < 10000" while "nss->input_count <
10000" as well, since the latter counts a subset of what the former
counts. So I only check the latter now.

I have not added back 32-bit support, which, IMV isn't worth it at
this time - it is, after all, almost a fully independent abbreviation
scheme to the 64-bit scheme. Right now we ought to be cutting scope,
or 9.5 will never reach feature freeze. I have already spent
considerably more time on this patch than I had intended to. I need to
catch a flight to New York at the crack of dawn tomorrow, to get to
pgConf US, and I have much work to do on my UPSERT patch, so I've
simply run out of time. If the committer that picks this up wants to
look at 32-bit support, that may make sense. I think that given the
current lack of cycles from everyone, we'll be doing well to even get
64-bit numeric abbreviation support in 9.5. I ask that you have a some
perspective on cutting 32-bit support, Andrew. In any case, I've
personally run out of time for this for 9.5.

We may add a patch to change things so that a GUC controls
abbreviation debug output, and we may add a patch that standardizes
that INT64_MIN and INT64_MAX are available everywhere. But unless and
until those other patches are committed, I see no reason to assume
that they will be, and so those aspects remain unchanged from my V2.
Unless there is strong opposition, or bugs come to light, or the
GUC/macros are added by independent commits to the master branch, I
expect to mark this revision "Ready for Committer" shortly.

Thanks
--
Peter Geoghegan

Attachment Content-Type Size
numeric_sortsup_v4.patch text/x-patch 15.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2015-03-23 17:01:48 Re: logical column ordering
Previous Message David Steele 2015-03-23 16:40:36 Re: Auditing extension for PostgreSQL (Take 2)