Re: Perform COPY FROM encoding conversions in larger chunks

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Perform COPY FROM encoding conversions in larger chunks
Date: 2021-01-28 13:05:39
Message-ID: 02da25ef-b579-2236-d3cd-0d07819cce98@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 28/01/2021 01:23, John Naylor wrote:
> Hi Heikki,
>
> 0001 through 0003 are straightforward, and I think they can be committed
> now if you like.
>
> 0004 is also pretty straightforward. The check you proposed upthread for
> pg_upgrade seems like the best solution to make that workable. I'll take
> a look at 0005 soon.
>
> I measured the conversions that were rewritten in 0003, and there is
> indeed a noticeable speedup:
>
> Big5 to EUC-TW:
>
> head    196ms
> 0001-3  152ms
>
> EUC-TW to Big5:
>
> head    190ms
> 0001-3  144ms
>
> I've attached the driver function for reference. Example use:
>
> select drive_conversion(
>   1000, 'euc_tw'::name, 'big5'::name,
>   convert('a few kB of utf8 text here', 'utf8', 'euc_tw')
> );

Thanks! I have committed patches 0001 and 0003 in this series, with
minor comment fixes. Next I'm going to write the pg_upgrade check for
patch 0004, to get that into a committable state too.

> I took a look at the test suite also, and the only thing to note is a
> couple places where the comment doesn't match the code:
>
> +  -- JIS X 0201: 2-byte encoded chars starting with 0x8e (SS2)
> +  byte1 = hex('0e');
> +  for byte2 in hex('a1')..hex('df') loop
> +    return next b(byte1, byte2);
> +  end loop;
> +
> +  -- JIS X 0212: 3-byte encoded chars, starting with 0x8f (SS3)
> +  byte1 = hex('0f');
> +  for byte2 in hex('a1')..hex('fe') loop
> +    for byte3 in hex('a1')..hex('fe') loop
> +      return next b(byte1, byte2, byte3);
> +    end loop;
> +  end loop;
>
> Not sure if it matters , but thought I'd mention it anyway.

Good catch! The comments were correct, and the tests were wrong, not
testing those 2- and 3-byte encoded characters as intened. Doesn't
matter for testing this patch, I only included those euc_jis_2004 tets
for the sake of completeness, but if someone finds this test suite in
the archives and want to use it for something real, make sure you fix
that first.

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Greg Nancarrow 2021-01-28 13:11:06 Re: Parallel INSERT (INTO ... SELECT ...)
Previous Message Masahiko Sawada 2021-01-28 12:52:28 Re: VACUUM (DISABLE_PAGE_SKIPPING on)