Re: Better locale-specific-character-class handling for regexps

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgreSQL(dot)org, Bruno Wolff III <bruno(at)wolff(dot)to>
Subject: Re: Better locale-specific-character-class handling for regexps
Date: 2016-09-04 13:56:32
Message-ID: e604d9ec-0c09-5e1d-14a2-196dd66fd462@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 08/23/2016 03:54 AM, Tom Lane wrote:
> ! That's still not quite enough, though, because of locale-dependent
> ! character classes such as [[:alpha:]]. In Unicode locales these classes
> ! may have thousands of entries that are above MAX_SIMPLE_CHR, and we
> ! certainly don't want to be searching large colormaprange arrays at runtime.
> ! Nor do we even want to spend the time to initialize cvec structures that
> ! exhaustively describe all of those characters. Our solution is to compute
> ! exact per-character colors at compile time only up to MAX_SIMPLE_CHR.
> ! For characters above that, we apply the <ctype.h> or <wctype.h> lookup
> ! functions at runtime for each locale-dependent character class used in the
> ! regex pattern, constructing a bitmap that describes which classes the
> ! runtime character belongs to. The per-character-range data structure
> ! mentioned above actually holds, for each range, a separate color entry
> ! for each possible combination of character class properties. That is,
> ! the color map for characters above MAX_SIMPLE_CHR is really a 2-D array,
> ! whose rows correspond to character ranges that are explicitly mentioned
> ! in the input, and whose columns correspond to sets of relevant locale
> ! character classes.

I think that last sentence should say "whose rows correspond to
character ranges that are explicitly mentioned in the *regex pattern*",
rather than "in the input".

An example would be very helpful here. I had to read through this many
times, until I understood it. I can easily come up with examples for
character classes, but not for "high" character-ranges. The best I could
come up with is to check if a characters belongs to some special group
of unicode characters, like U&'[\+01D100-\+01D1FF]' to check for musical
symbol characters. In practice, I guess you will only see single
characters in the colormaprange array, although we must of course cope
with ranges too.

> + /* this relies on WHITE being zero: */
> + memset(cm->locolormap, WHITE,
> + (MAX_SIMPLE_CHR - CHR_MIN + 1) * sizeof(color));
> +
> + memset(cm->classbits, 0, sizeof(cm->classbits));
> + cm->numcmranges = 0;
> + cm->cmranges = NULL;
> + cm->maxarrayrows = 4; /* arbitrary initial allocation */
> + cm->hiarrayrows = 1;
> + cm->hiarraycols = 1;
> + cm->hicolormap = (color *) MALLOC(cm->maxarrayrows * sizeof(color));
> + if (cm->hicolormap == NULL)
> + {
> + CERR(REG_ESPACE);
> + return;
> + }
> + /* initialize the "all other characters" row to WHITE */
> + cm->hicolormap[0] = WHITE;

Is the comment correct? I don't see why this wouldn't work with "WHITE
!= 0".

> ! /* Duplicate existing columns to the right, and increase ref counts */
> ! /* Must work downwards in the array because we realloc'd in place */
> ! for (r = cm->hiarrayrows - 1; r >= 0; r--)
> ! {
> ! color *oldrowptr = &newarray[r * cm->hiarraycols];
> ! color *newrowptr = &newarray[r * cm->hiarraycols * 2];
> ! color *newrowptr2 = newrowptr + cm->hiarraycols;
>
> ! for (c = 0; c < cm->hiarraycols; c++)
> ! {
> ! color co = oldrowptr[c];
> !
> ! newrowptr[c] = newrowptr2[c] = co;
> ! cm->cd[co].nuchrs++;
> ! }
> ! }

Perhaps "backwards" would be clearer than "downwards"? At least in my
mental model, index 0 is the top row of an array, so "downwards" means
0, 1, 2. I guess you meant downwards numerically, rather than visually,
but it took me a moment to process that.

+1 for this patch in general. Some regression test cases would be nice.

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2016-09-04 14:30:35 Re: LSN as a recovery target
Previous Message Abhijit Menon-Sen 2016-09-04 13:32:17 Re: LSN as a recovery target