From: | Jacob Champion <pchampion(at)vmware(dot)com> |
---|---|
To: | "john(dot)naylor(at)enterprisedb(dot)com" <john(dot)naylor(at)enterprisedb(dot)com>, "michael(at)paquier(dot)xyz" <michael(at)paquier(dot)xyz> |
Cc: | "peter(dot)eisentraut(at)enterprisedb(dot)com" <peter(dot)eisentraut(at)enterprisedb(dot)com>, "pavel(dot)stehule(at)gmail(dot)com" <pavel(dot)stehule(at)gmail(dot)com>, "laurenz(dot)albe(at)cybertec(dot)at" <laurenz(dot)albe(at)cybertec(dot)at>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, "horikyota(dot)ntt(at)gmail(dot)com" <horikyota(dot)ntt(at)gmail(dot)com> |
Subject: | Re: badly calculated width of emoji in psql |
Date: | 2021-08-20 00:05:19 |
Message-ID: | 39db2e88a680c5c0c113eb1977db90fa19f97155.camel@vmware.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Thu, 2021-08-19 at 13:49 -0400, John Naylor wrote:
> I had a couple further thoughts:
>
> 1. The coding historically used normal comparison and branching for
> everything, but recently it only does that to detect control
> characters, and then goes through a binary search (and with this
> patch, two binary searches) for everything else. Although the
> performance regression of the current patch seems negligible,
If I'm reading the code correctly, ASCII characters don't go through
the binary searches; they're already short-circuited at the beginning
of mbbisearch(). On my machine that's enough for the patch to be a
performance _improvement_ for ASCII, not a regression.
Does adding another short-circuit at the top improve the
microbenchmarks noticeably? I assumed the compiler had pretty well
optimized all that already.
> we could use almost the same branches to fast-path printable ascii
> text, like this:
>
> + /* fast path for printable ASCII characters */
> + if (ucs >= 0x20 || ucs < 0x7f)
> + return 1;
Should be && instead of ||, I think.
> +
> /* test for 8-bit control characters */
> if (ucs == 0)
> return 0;
>
> - if (ucs < 0x20 || (ucs >= 0x7f && ucs < 0xa0) || ucs > 0x0010ffff)
> + if (ucs < 0xa0 || ucs > 0x0010ffff)
> return -1;
>
> 2. As written, the patch adds a script that's very close to an
> existing one, and emits a new file that has the same type of contents
> as an existing one, both of which are #included in one place. I
> wonder if we should consider having just one script that ingests both
> files and emits one file. All we need is for mbinterval to encode the
> character width, but we can probably do that with a bitfield like the
> normprops table to save space. Then, we only do one binary search.
> Opinions?
I guess it just depends on what the end result looks/performs like.
We'd save seven hops or so in the worst case?
--Jacob
From | Date | Subject | |
---|---|---|---|
Next Message | Sadhuprasad Patro | 2021-08-20 01:02:14 | Re: Support reset of Shared objects statistics in "pg_stat_reset" function |
Previous Message | Michael Paquier | 2021-08-19 23:51:17 | Re: Allow declaration after statement and reformat code to use it |