Re: trgm regex index peculiarity

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: trgm regex index peculiarity
Date: 2014-02-09 19:28:54
Message-ID: CAPpHfdtca-Jh5FX=HGG7D0Q9nf0SBK2FSfGr4mo1vKfSyGGgCw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jan 16, 2014 at 3:34 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Alexander Korotkov <aekorotkov(at)gmail(dot)com> writes:
> > Revised version of patch with necessary comments.
>
> I looked at this patch a bit. It seems like this:
>
> + * BLANK_COLOR_SIZE - How much blank character is more frequent than
> + * other character in average
> + #define BLANK_COLOR_SIZE 32
>
> is a drastic overestimate. Using the program Erik provided to generate
> some sample data, I find that blanks in trigrams are maybe about three
> times more common than other characters. Possibly Erik's data isn't
> very representative of real text, but if that's the case we'd better
> get a more representative sample case before we start optimizing ...
>
> Now maybe the above number is okay for this purpose anyway, but if so
> it needs some different justification; maybe call it a "penalty"?
> But nonetheless it seems like a darn large penalty, particularly for " x"
> type trigrams where the value would effectively be squared. Compared to
> the proposed values of MAX_TRGM_SIZE and WISH_TRGM_SIZE, it seems like
> you might as well forget the "sizing" approach and just flat out reject
> trigrams containing COLOR_BLANK, because they'll never get through the
> size filter.

It seems to be right decision to me when we have other trigrams can reject
them. But there are cases when we can't.

> I'm inclined to think you need a larger MAX_TRGM_SIZE;
> and WISH_TRGM_SIZE seems mighty small as well, compared to what the
> code would have done before.
>

OK, I would like to make more reasoning for penalty.
Let's consider word of length n.
It contains n+1 trigrams including:
1 __x trigram
1 _xx trigram
1 xx_ trigram
n - 2 xxx trigrams

Assume alphabet of size m those trigrams have following average frequencies:
__x 1/((n+1)*m)
_xx 1/((n+1)*m^2)
xx_ 1/((n+1)*m^2)
xxx (n-2)/((n+1)*m^3)

The ratio of this frequencies with blanks to frequency of xxx is:
__x m^2/(n-2)
_xx and xx_ m/(n-2)

In order to estimate n I've analyzed some classics:
Ernest Hemingway "A farewell to arms" - 3.8
Leo Tolstoy "War and Peace" - 5.0

In english with alphabet size = 26 we have

__x m^2/(n-2) = 375
_xx and xx_ m/(n-2) = 14.4

In russian with alphabet size = 33 we have

__x m^2/(n-2) = 363
_xx and xx_ m/(n-2) = 11

Assuming these calculations is approximate, we can safely use same values
for both languages.
Does such reasoning make sense?

Also, surely this bit:
>
> ! trgmNFA->totalTrgmCount = (int) totalTrgmSize;
>
> is flat out wrong? expandColorTrigrams() expects that
> trgmNFA->totalTrgmCount is the truth, not some penalty-bloated
> abstraction. The fact that the patch hasn't failed on you completely
> demonstrates that you've not tested any cases where blank-containing
> trigrams got through the filter, reinforcing my feeling that it's
> probably too strict now.
>

Oh, I wonder how can I leave such weird bug in patch :-(

> I wonder if it would be more appropriate to continue to measure the limit
> MAX_TRGM_COUNT in terms of actual trigram counts, and just use the
> penalized "size" as the sort key while determining which color trigrams
> to discard first.
>

Agree. But I would like to save some equivalent of WISH_TRGM_SIZE.

Another thought is that it's not clear that you should apply the same
> penalty to blanks in all three positions. Because of the padding
> settings, any one word will normally lead to padded strings " a",
> " ab", "yz ", so that spaces in the first position are about twice
> as common as spaces in the other positions. (It's a little less
> than that because single-letter words produce only " a" and " a ";
> but I'd think that such words aren't very common in text that trigrams
> are effective for.) So I'm thinking the penalty ought to take that
> into account.
>
> I'm also inclined to think that it might be worth adding a size
> field to ColorTrgmInfo rather than repeatedly calculating the
> size estimate. We only allow a thousand and some ColorTrgmInfos
> at most, so the extra space wouldn't be that large, and I'm concerned
> by the expense the current patch adds to the sort comparator.
>

Agree.

>
> Another point is that this comment:
>
> * Note: per-color-trigram counts cannot overflow an int so long as
> * COLOR_COUNT_LIMIT is not more than the cube root of INT_MAX, ie
> about
> * 1290. However, the grand total totalTrgmCount might conceivably
> * overflow an int, so we use int64 for that within this routine.
>
> is no longer valid, or at least it fails to demonstrate that the result
> of getColorTrigramSize() can't overflow an int; indeed I rather fear it
> can.
> The patch failed to even update the variable name in this comment, let
> alone address the substantive question.
>
> There are some other cosmetic things I didn't like, for instance
> colorTrgmInfoCountCmp() is no longer comparing counts but you didn't
> rename it. That's about it for substantive comments though.
>

Thanks, will be fixed.

------
With best regards,
Alexander Korotkov.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2014-02-09 21:01:59 Re: trgm regex index peculiarity
Previous Message Tom Lane 2014-02-09 18:19:55 Re: narwhal and PGDLLIMPORT