Pg_trgm and "invalid invalid byte sequence for encoding UTF8"

From: alexandros_e <alexandros(dot)ef(at)gmail(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Pg_trgm and "invalid invalid byte sequence for encoding UTF8"
Date: 2014-02-12 20:20:57
Message-ID: 1392236457460-5791681.post@n5.nabble.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hello experts,

I want to compare integer arrays basically with methods based on string
similarity (i.e., levenshtein, trigrams etc).. In order to do that I hacked
a custom function that converts those integer array to strings, where each
integer is converted to a character by the function CHR(my_array1[i]+64) (so
that 1->A, 2 ->B etc). This hack of course for large integers (I have
integers up to 300,000) probably creates invalid UTF-8 characters.
Levenshtein (from fuzzystrmatch module) does not seem to have a problem with
that and works perfectly, since it is based on just comparing UTF8 codes. On
the other hand when I try similarity function
array1<->array1 for some cases it works (I think it works for all integers
up to 4096) but for some larger indexes I get invalid byte sequence for
encoding "UTF8" errors:

Example integer sequence

"8527,63586,8526,63585,63584,63583,63582,8525,8760,63820,63821,63822,860,57610,861,57611,862,57612,57613,863,57614,57615,57616,39850,39851,39852,39853,39854,39855,95275,39856,39857,95276,95277,39858,95278,95279,39859,95280,39860,95281,95282,39861,39862,39863,95283,95284,27095,27096,82406,82407,27097,27098,27099,27100,82408,27101,27102,27103,25702,80837,25703,25704,80838,25705,25706,25707,25708,30011,85343,30012,85344,30013,30014,51019,48260,48261,56809,56810,56811,56812,113829,31762,87568,31763,45925,41778,41779,41780,31778,31779,87571}";

Error message:

invalid byte sequence for encoding "UTF8": 0xed 0xb8 0xa9

Is there a way to suppress these errors similar to levenshtein which does
not care about validity of UTF characters?

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Pg-trgm-and-invalid-invalid-byte-sequence-for-encoding-UTF8-tp5791681.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

Browse pgsql-general by date

  From Date Subject
Next Message Bruce Momjian 2014-02-12 20:39:44 Re: pg_test_fsync: "Invalid argument" in the middle of a test
Previous Message Leonardo M. Ramé 2014-02-12 19:46:02 Re: pg_restore issue