Quick Links

Re: Improved ICU patch - WAS: Implementing full UTF-8 support (aka supporting 0x00)

From:	Palle Girgensohn <girgen(at)pingpong(dot)net>
To:	Peter Geoghegan <pg(at)heroku(dot)com>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Álvaro Hernández Tortosa <aht(at)8kdata(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Devrim Gündüz <devrim(at)gunduz(dot)org>, Jakob Egger <jakob(at)eggerapps(dot)at>, Tobias Bussmann <t(dot)bussmann(at)gmx(dot)net>
Subject:	Re: Improved ICU patch - WAS: Implementing full UTF-8 support (aka supporting 0x00)
Date:	2016-08-11 11:22:09
Message-ID:	789A2F56-0E42-409D-A840-6AF5110D6085@pingpong.net
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

> 11 aug. 2016 kl. 11:15 skrev Palle Girgensohn <girgen(at)pingpong(dot)net>:
>
>>
>> 11 aug. 2016 kl. 03:05 skrev Peter Geoghegan <pg(at)heroku(dot)com>:
>>
>> On Wed, Aug 10, 2016 at 1:42 PM, Palle Girgensohn <girgen(at)pingpong(dot)net> wrote:
>>> They've been used for the FreeBSD ports since 2005, and have served us well. I have of course updated them regularly. In this latest version, I've removed support for other encodings beside UTF-8, mostly since I don't know how to test them, but also, I see little point in supporting anything else using ICU.
>>
>> Looks like you're not using the ICU equivalent of strxfrm(). While 9.5
>> is not the release that introduced its use, it did expand it
>> significantly. I think you need to fix this, even though it isn't
>> actually used to sort text at present, since presumably FreeBSD builds
>> of 9.5 don't TRUST_STRXFRM. Since you're using ICU, though, you could
>> reasonably trust the ICU equivalent of strxfrm(), so that's a missed
>> opportunity. (See the wiki page on the abbreviated keys issue [1] if
>> you don't know what I'm talking about.)
>
> My plan was to get it working without TRUST_STRXFRM first, and then add that functinality. I've made some preliminary tests using ICU:s ucol_getSortKey but I will have to test it a bit more. For now, I just expect not to trust strxfrm. It is the first iteration wrt strxfrm, the plan is to use that code base.

Here are some preliminary results running 10000 times comparing the same two strings in a tight loop.

ucol_strcollUTF8: -1 0.002448
strcoll: 1 0.060711
ucol_strcollIter: -1 0.009221
direct memcmp: 1 0.000457
memcpy memcmp: 1 0.001706
memcpy strcoll: 1 0.068425
nextSortKeyPart: -1 0.041011
ucnv_toUChars + getSortKey: -1 0.050379

correct answer is -1, but since we compare åasdf and äasdf with a Swedish locale, memcmp and strcoll fails of course, as espected. Direct memcmp is 5 times faster than ucol_strcollUTF8 (used in my patch), but sadly the best implementation using sort keys with ICU, nextSortKeyPart, is way slower.

startTime = getRealTime();
for ( int i = 0; i < loop; i++) {
result = ucol_strcollUTF8(collator, arg1, len1, arg2, len2, &status);
}
endTime = getRealTime();
printf("%30s: %d\t%lf\n", "ucol_strcollUTF8", result, endTime - startTime);

int sortkeysize=8;

startTime = getRealTime();
uint8_t key1[sortkeysize], key2[sortkeysize];
uint32_t sState[2], tState[2];
UCharIterator sIter, tIter;

for ( int i = 0; i < loop; i++) {
uiter_setUTF8(&sIter, arg1, len1);
uiter_setUTF8(&tIter, arg2, len2);
sState[0] = 0; sState[1] = 0;
tState[0] = 0; tState[1] = 0;
ucol_nextSortKeyPart(collator, &sIter, sState, key1, sortkeysize, &status);
ucol_nextSortKeyPart(collator, &tIter, tState, key2, sortkeysize, &status);
result = memcmp (key1, key2, sortkeysize);
}
endTime = getRealTime();
printf("%30s: %d\t%lf\n", "nextSortKeyPart", result, endTime - startTime);

But in your strxfrm code in PostgreSQL, the keys are cached, and represented as int64:s if I remember correctly, so perhaps there is still a benefit using the abbreviated keys? More testing is required, I guess...

Palle

In response to

Re: Improved ICU patch - WAS: Implementing full UTF-8 support (aka supporting 0x00) at 2016-08-11 09:15:36 from Palle Girgensohn

Responses

Re: Improved ICU patch - WAS: Implementing full UTF-8 support (aka supporting 0x00) at 2016-08-11 16:56:25 from Peter Geoghegan

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Stas Kelvich	2016-08-11 11:34:30	Re: Logical Replication WIP
Previous Message	Shay Rojansky	2016-08-11 10:07:52	Re: Slowness of extended protocol