Re: Bug in UTF8-Validation Code?

From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Jeff Davis <pgsql(at)j-davis(dot)com>, Michael Fuhr <mike(at)fuhr(dot)org>, Mario Weilguni <mweilguni(at)sime(dot)com>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Albe Laurenz <all(at)adv(dot)magwien(dot)gv(dot)at>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-03-18 04:03:58
Message-ID: 45FCBA2E.7010303@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Tom Lane wrote:
> Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
>
>> Here are some timing tests in 1m rows of random utf8 encoded 100 char
>> data. It doesn't look to me like the saving you're suggesting is worth
>> the trouble.
>>
>
> Hmm ... not sure I believe your numbers. Using a test file of 1m lines
> of 100 random latin1 characters converted to utf8 (thus, about half and
> half 7-bit ASCII and 2-byte utf8 characters), I get this in SQL_ASCII
> encoding:
>
> regression=# \timing
> Timing is on.
> regression=# create temp table test(f1 text);
> CREATE TABLE
> Time: 5.047 ms
> regression=# copy test from '/home/tgl/zzz1m';
> COPY 1000000
> Time: 4337.089 ms
>
> and this in UTF8 encoding:
>
> utf8=# \timing
> Timing is on.
> utf8=# create temp table test(f1 text);
> CREATE TABLE
> Time: 5.108 ms
> utf8=# copy test from '/home/tgl/zzz1m';
> COPY 1000000
> Time: 7776.583 ms
>
> The numbers aren't super repeatable, but it sure looks to me like the
> encoding check adds at least 50% to the runtime in this example; so
> doing it twice seems unpleasant.
>
> (This is CVS HEAD, compiled without assert checking, on an x86_64
> Fedora Core 6 box.)
>
>
>

Here are some test results that are closer to yours. I used a temp table
and had cassert off and fsync off, and tried with several encodings.

The additional load from the test isn't 50%, (I think you have added the
cost of going from ascii to utf8 to the cost of the test to get that
50%) but it is nevertheless appreciable.

I agree that we should look at not testing if the client and server
encodings are the same, so we can reduce the difference.

cheers

andrew

Run SQL_ASCII LATIN1 UTF8

1 4659.38 4766.07 9134.53

2 7999.64 4003.13 6231.41

3 4178.46 6178.89 7266.39

Without test 4 4201.7 3930.84 10154.38

5 4092.44 4444.52 9438.24

6 3977.34 4197.09 8866.56

Average 4851.49 4586.76 8515.25

1 11993.86 12625.8 10109.89

2 4647.16 9192.53 11251.27

With test 3 4211.02 9903.77 10097.37

4 9203.62 7045.06 10372.25

5 4121.39 4138.78 10386.92

6 3722.73 4552.09 7432.56

Average 6316.63 7909.67 9941.71

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeremy Drake 2007-03-18 05:47:07 Re: patch adding new regexp functions
Previous Message Grzegorz Jaskiewicz 2007-03-18 00:44:33 Re: [PATCHES] Bitmapscan changes