Re: B-Tree support function number 3 (strxfrm() optimization)

From: Claudio Freire <klaussfreire(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Peter Geoghegan <pg(at)heroku(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Greg Stark <stark(at)mit(dot)edu>, Noah Misch <noah(at)leadboat(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Thom Brown <thom(at)linux(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: B-Tree support function number 3 (strxfrm() optimization)
Date: 2014-06-12 16:42:05
Message-ID: CAGTBQpaq3gSbwtFsocZDRToBhoepWMuGnC6k2DV6aHcuD537mg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jun 12, 2014 at 1:25 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> It appears that any string starting with the letter "a" will create
> output that begins with 001S00 and the seventh character always
> appears to be 0 or 1:
>
> [rhaas ~]$ ./strxfrm en_US ab ac ad ae af a% a0 "a "
> "ab" -> "001S001T0000001S001T"
> "ac" -> "001S001U0000001S001U"
> "ad" -> "001S001V0000001S001V"
> "ae" -> "001S001W0000001S001W"
> "af" -> "001S001X0000001S001X"
> "a%" -> "001S000W0000001S000W"
> "a0" -> "001S000b0000001S000b"
> "a " -> "001S000R0000001S000R"

...

> [rhaas(at)hydra ~]$ ./strxfrm-binary en_US aaaaaaaa aaaaaaaA
> "aaaaaaaa" -> 0c0c0c0c0c0c0c0c010202020202020202010202020202020202 (26 bytes)
> "aaaaaaaA" -> 0c0c0c0c0c0c0c0c010202020202020202010202020202020204 (26 bytes)
>
> Still, it's fair to say that on this Linux system, the first 8 bytes
> capture a significant portion of the entropy of the first 8 bytes of
> the string, whereas on MacOS X you only get entropy from the first 2
> bytes of the string. It would be interesting to see results from
> other platforms people might care about also.

If you knew mac's output character set with some certainty, you could
compress it rather efficiently by applying a tabulated decode back
into non-printable bytes. Say, like base64 decoding (the set appears
to be a subset of base64, but it's hard to be sure).

Ie,
x = strxfrm(s)
xz = hex(tab[x[0]] + 64*tab[x[1]] + 64^2*tab[x[2]] ... etc)

This can be made rather efficient. But the first step is defining with
some certainty the output character set.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Sawada Masahiko 2014-06-12 17:16:52 add line number as prompt option to psql
Previous Message Tom Lane 2014-06-12 16:37:48 refactoring allpaths.c (was Re: Suppressing unused subquery output columns)