Re: Re: Reusing abbreviated keys during second pass of ordered [set] aggregates

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Peter Geoghegan <pg(at)heroku(dot)com>
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Re: Reusing abbreviated keys during second pass of ordered [set] aggregates
Date: 2015-12-22 17:31:31
Message-ID: CA+TgmoY7LdpFSanui2PawqZyCe6TxCL47Mb3yOtnSE=gfXyohw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Dec 21, 2015 at 2:55 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
> On Mon, Dec 21, 2015 at 10:53 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> PFA my proposal for comment changes for 9.5 and master. This is based
>> on your 0001, but I edited somewhat. Please let me know your
>> thoughts. I am not willing to go further and rearrange actual code in
>> 9.5 at this point; it just isn't necessary.
>
> Fine by me. But this revision hasn't made the important point at all
> -- which is that 0002 is safe. That's a stronger guarantee than the
> abbreviated key representation being pass-by-value.

Right. I don't think that we should back-patch that stuff into 9.5.

>> Looking at this whole system again, I wonder if we're missing a trick
>> here. How about if we decree from on high that the abbreviated-key
>> comparator is always just the application of an integer comparison
>> operator? The only abbreviation function that we have right now that
>> wouldn't be happy with that is the one for numeric, and that looks
>> like it could be fixed.
>
> I gather you mean that we'd decree that they were always compared
> using a text or uuid style 3-way unsigned integer comparison, allowing
> us to hardcode that.

Right.

>> Then we could hard-code knowledge of this
>> representation in tuplesort.c in such a way that it wouldn't need to
>> call a comparator function at all - instead of doing
>> ApplySortComparator() and then maybe ApplySortAbbrevFullComparator(),
>> it would do something like:
>>
>> if (using_abbreviation && (compare = ApplyAbbrevComparator(...)) != 0)
>> return compare;
>>
>> I'm not sure if that would save enough cycles vs. the status quo to be
>> worth bothering with, but it seems like it might. You may have a
>> better feeling for that than I do.
>
> I like this idea. I thought of it myself already, actually.
>
> It has some nice properties, because unsigned integers are so simple
> and flexible. For example, we can do it with int4 sometimes, and then
> compare two attributes at once on 64-bit platforms. Maybe the second
> attribute (the second set of 4 bytes) isn't an int4, though -- it's
> any other type's abbreviated key (which can be assumed to have a
> compatible representation). That's one more advanced possibility.

Yikes.

> It seems worthwhile because numeric is the odd one out. Once I or
> someone else gets around to adding network types, which could happen
> in 9.6, then we are done (because there is already a pending patch
> that adds support for remaining character-like types and alternative
> opclasses). I really don't foresee much additional use of abbreviated
> keys beyond these cases. There'd be a small but nice boost by not
> doing pointer chasing for the abbreviated key comparison, I imagine,
> but that benefit would be felt everywhere. Numeric is the odd one out
> currently, but as you say it can be fixed, and I foresee no other
> trouble from any other opclass/type that cares about abbreviation.
>
> Another issue is that abbreviated keys in indexes are probably not
> going to take 8 bytes, because they'll go in the ItemId array. It's
> likely to be very useful to be able to take the first two bytes, and
> know that a uint16 comparison is all that is needed, and have a basis
> to perform an interpolation search rather than a binary search
> (although that needs to be adaptive to different distributions, I
> think it could be an effective technique -- there are many cache
> misses for binary searches on internal B-Tree pages).

Hmm, that seems iffy to me. There are plenty of data sets where 8
bytes is enough to get some meaningful information about what part of
the keyspace you're in, but 2 bytes isn't.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2015-12-22 17:34:26 Re: Possible marginally-incompatible change to array subscripting
Previous Message Robert Haas 2015-12-22 17:13:32 Re: Speed up Clog Access by increasing CLOG buffers