Re: B-Tree support function number 3 (strxfrm() optimization)

From: Wim Lewis <wiml(at)omnigroup(dot)com>
To: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Cc: Peter Geoghegan <pg(at)heroku(dot)com>
Subject: Re: B-Tree support function number 3 (strxfrm() optimization)
Date: 2014-07-29 21:42:24
Message-ID: 62B542AE-C43B-4B48-8989-E3999F326534@omnigroup.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


On 28 Jul 2014, at 5:23 PM, Peter Geoghegan wrote:
> On Mon, Jul 28, 2014 at 5:14 PM, Wim Lewis <wiml(at)omnigroup(dot)com> wrote:
>> A quick glance at OSX's strxfrm() suggests they're using an implementation of strxfrm() from FreeBSD. You can find the source here:
>>
>> http://www.opensource.apple.com/source/Libc/Libc-997.90.3/string/FreeBSD/strxfrm.c
>>
>> (and a really quick glance at the contents of libc on OSX 10.9 reinforces this--- I don't see any calls into their CoreFoundation unicode string APIs.)
>
> Something isn't quite accounted for, then. The FreeBSD behavior is to
> append the primary weights only. That makes their returned blobs
> smaller than those you'll see on Linux, but also appears to imply that
> their implementation is substandard (The PostgreSQL port uses ICU on
> FreeBSD for a reason, I suppose). But FreeBSD did not add extra,
> redundant "header bytes" right in the primary level when I tested it,
> but I'm told Mac OS X does.

I don't think OSX actually does. From a look at the source and a simple test program, OSX's strxfrm represents its output as a series of 24-bit weights, each one encoded into four bytes. Multiple "levels" of weights are separated by a weight of 0 (which is encoded as the four bytes "0000").

Robert Haas' message in this thread on 7 April is the first mention of a header, but his examples from 12 June don't really demonstrate a header--- they're completely consistent with the description above and the published source code.

OSX's strxfrm output is very space-inefficient, especially for Latin text, where none of the weights seem to be greater than 2^12, meaning that half of the bytes in any output are always going to be 0x30. (Testing with, e.g., Hangul characters I find some weights that use more of the space.) But as far as I can tell it's not completely crazy. :)

What this means is that on OSX, comparing the first 8 bytes of strxfrm output will end up comparing the primary weights of the first two characters in the original string. Which is the same conclusion Robert Haas came to earlier, modulo the "header bytes" interpretation.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2014-07-29 21:59:20 Re: Reminder: time to stand down from 8.4 maintenance
Previous Message Robert Haas 2014-07-29 20:17:50 Re: Proposal to add a QNX 6.5 port to PostgreSQL