Re: [GENERAL] workaround for lack of REPLACE() function

From: Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp>
To: mail(at)joeconway(dot)com
Cc: pgsql-hackers(at)postgresql(dot)org, lockhart(at)fourpalms(dot)org, tgl(at)sss(dot)pgh(dot)pa(dot)us
Subject: Re: [GENERAL] workaround for lack of REPLACE() function
Date: 2002-08-12 01:07:20
Message-ID: 20020812.100720.91313063.t-ishii@sra.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers pgsql-patches

> Now, in the multibyte case, again in textlen(), I see:
>
> /* optimization for single byte encoding */
> if (pg_database_encoding_max_length() <= 1)
> PG_RETURN_INT32(VARSIZE(t) - VARHDRSZ);
>
> PG_RETURN_INT32(
> pg_mbstrlen_with_len(VARDATA(t), VARSIZE(t) - VARHDRSZ));
>
> Three questions here.
> 1) In the case of encoding max length == 1, can we treat it the same as
> the non-multibyte case (I presume they are exactly the same)?

Yes.

> 2) Can encoding max length ever be < 1? Doesn't make sense to me.

No. It seems just a defensive coding.

> 3) In the case of encoding max length > 1, if I understand correctly,
> each encoded character can be one *or more* bytes, up to and encluding
> encoding max length bytes.

Right.

> So the *only* way presently to get the length
> of the original character string is to loop through the entire string
> checking the length of each individual character (that's what
> pg_mbstrlen_with_len() does it seems)?

Yes.

> Finally, if 3) is true, then there is no way to avoid the retrieval and
> decompression of the datum just to find out its length. For large
> datums, detoasting plus the looping through each character would add a
> huge amount of overhead just to get at the length of the original
> string. I don't know if we need to be able to get *just* the length
> often enough to really care, but if we do, I had an idea for some future
> release (I wouldn't propose doing this for 7.3):
>
> - add a new EXTENDED state to va_external for MULTIBYTE
> - any string with max encoding length > 1 would be EXTENDED even if it
> is not EXTERNAL and not COMPRESSED.
> - to each of the structs in the union, add va_strlen
> - populate va_strlen on INSERT and maintain it on UPDATE.
>
> Now a new function similar to toast_raw_datum_size(), maybe
> toast_raw_datum_strlen() could be used to get the original string
> length, whether MB or not, without needing to retrieve and decompress
> the entire datum.
>
> I understand we would either: have to steal another bit from the VARHDR
> which would reduce the effective size of a valena from 1GB down to .5GB;
> or we would need to add a byte or two to the VARHDR which is extra
> per-datum overhead. I'm not sure we would want to do either. But I
> wanted to toss out the idea while it was fresh on my mind.

Interesting idea. I also was thinking about adding some extra
infomation to text data types such as character set, collation
etc. for 7.4 or later.
--
Tatsuo Ishii

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Curt Sampson 2002-08-12 03:25:06 Re: oid's and primary keys on insert
Previous Message Justin Clift 2002-08-12 00:40:27 Another database web poll

Browse pgsql-hackers by date

  From Date Subject
Next Message Christopher Kings-Lynne 2002-08-12 02:15:16 Re: python patch
Previous Message Curt Sampson 2002-08-12 00:18:16 Re: Table Inheritance Discussion

Browse pgsql-patches by date

  From Date Subject
Next Message Rod Taylor 2002-08-12 01:19:26 Dump serials as serial -- not a sequence
Previous Message Alvaro Herrera 2002-08-11 23:44:25 Re: PGPASSWORDFILE env var for libpq