From: | Claudio Freire <klaussfreire(at)gmail(dot)com> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | Peter Geoghegan <pg(at)heroku(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Greg Stark <stark(at)mit(dot)edu>, Noah Misch <noah(at)leadboat(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Thom Brown <thom(at)linux(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: B-Tree support function number 3 (strxfrm() optimization) |
Date: | 2014-06-12 16:42:05 |
Message-ID: | CAGTBQpaq3gSbwtFsocZDRToBhoepWMuGnC6k2DV6aHcuD537mg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Thu, Jun 12, 2014 at 1:25 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> It appears that any string starting with the letter "a" will create
> output that begins with 001S00 and the seventh character always
> appears to be 0 or 1:
>
> [rhaas ~]$ ./strxfrm en_US ab ac ad ae af a% a0 "a "
> "ab" -> "001S001T0000001S001T"
> "ac" -> "001S001U0000001S001U"
> "ad" -> "001S001V0000001S001V"
> "ae" -> "001S001W0000001S001W"
> "af" -> "001S001X0000001S001X"
> "a%" -> "001S000W0000001S000W"
> "a0" -> "001S000b0000001S000b"
> "a " -> "001S000R0000001S000R"
...
> [rhaas(at)hydra ~]$ ./strxfrm-binary en_US aaaaaaaa aaaaaaaA
> "aaaaaaaa" -> 0c0c0c0c0c0c0c0c010202020202020202010202020202020202 (26 bytes)
> "aaaaaaaA" -> 0c0c0c0c0c0c0c0c010202020202020202010202020202020204 (26 bytes)
>
> Still, it's fair to say that on this Linux system, the first 8 bytes
> capture a significant portion of the entropy of the first 8 bytes of
> the string, whereas on MacOS X you only get entropy from the first 2
> bytes of the string. It would be interesting to see results from
> other platforms people might care about also.
If you knew mac's output character set with some certainty, you could
compress it rather efficiently by applying a tabulated decode back
into non-printable bytes. Say, like base64 decoding (the set appears
to be a subset of base64, but it's hard to be sure).
Ie,
x = strxfrm(s)
xz = hex(tab[x[0]] + 64*tab[x[1]] + 64^2*tab[x[2]] ... etc)
This can be made rather efficient. But the first step is defining with
some certainty the output character set.
From | Date | Subject | |
---|---|---|---|
Next Message | Sawada Masahiko | 2014-06-12 17:16:52 | add line number as prompt option to psql |
Previous Message | Tom Lane | 2014-06-12 16:37:48 | refactoring allpaths.c (was Re: Suppressing unused subquery output columns) |