Quick Links

Re: B-Tree support function number 3 (strxfrm() optimization)

From:	Claudio Freire <klaussfreire(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Peter Geoghegan <pg(at)heroku(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Greg Stark <stark(at)mit(dot)edu>, Noah Misch <noah(at)leadboat(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Thom Brown <thom(at)linux(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: B-Tree support function number 3 (strxfrm() optimization)
Date:	2014-06-12 16:42:05
Message-ID:	CAGTBQpaq3gSbwtFsocZDRToBhoepWMuGnC6k2DV6aHcuD537mg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Thu, Jun 12, 2014 at 1:25 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> It appears that any string starting with the letter "a" will create
> output that begins with 001S00 and the seventh character always
> appears to be 0 or 1:
>
> [rhaas ~]$ ./strxfrm en_US ab ac ad ae af a% a0 "a "
> "ab" -> "001S001T0000001S001T"
> "ac" -> "001S001U0000001S001U"
> "ad" -> "001S001V0000001S001V"
> "ae" -> "001S001W0000001S001W"
> "af" -> "001S001X0000001S001X"
> "a%" -> "001S000W0000001S000W"
> "a0" -> "001S000b0000001S000b"
> "a " -> "001S000R0000001S000R"

...

> [rhaas(at)hydra ~]$ ./strxfrm-binary en_US aaaaaaaa aaaaaaaA
> "aaaaaaaa" -> 0c0c0c0c0c0c0c0c010202020202020202010202020202020202 (26 bytes)
> "aaaaaaaA" -> 0c0c0c0c0c0c0c0c010202020202020202010202020202020204 (26 bytes)
>
> Still, it's fair to say that on this Linux system, the first 8 bytes
> capture a significant portion of the entropy of the first 8 bytes of
> the string, whereas on MacOS X you only get entropy from the first 2
> bytes of the string. It would be interesting to see results from
> other platforms people might care about also.

If you knew mac's output character set with some certainty, you could
compress it rather efficiently by applying a tabulated decode back
into non-printable bytes. Say, like base64 decoding (the set appears
to be a subset of base64, but it's hard to be sure).

Ie,
x = strxfrm(s)
xz = hex(tab[x[0]] + 64*tab[x[1]] + 64^2*tab[x[2]] ... etc)

This can be made rather efficient. But the first step is defining with
some certainty the output character set.

In response to

Re: B-Tree support function number 3 (strxfrm() optimization) at 2014-06-12 16:25:09 from Robert Haas

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Sawada Masahiko	2014-06-12 17:16:52	add line number as prompt option to psql
Previous Message	Tom Lane	2014-06-12 16:37:48	refactoring allpaths.c (was Re: Suppressing unused subquery output columns)