Re: trgm regex index peculiarity

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: trgm regex index peculiarity
Date: 2014-02-09 21:01:59
Message-ID: 27899.1391979719@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Alexander Korotkov <aekorotkov(at)gmail(dot)com> writes:
> On Thu, Jan 16, 2014 at 3:34 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> I looked at this patch a bit. It seems like this:
>> + * BLANK_COLOR_SIZE - How much blank character is more frequent than
>> + * other character in average
>> + #define BLANK_COLOR_SIZE 32
>> is a drastic overestimate.

> OK, I would like to make more reasoning for penalty.
> Let's consider word of length n.
> It contains n+1 trigrams including:
> 1 __x trigram
> 1 _xx trigram
> 1 xx_ trigram
> n - 2 xxx trigrams

> Assume alphabet of size m those trigrams have following average frequencies:
> __x 1/((n+1)*m)
> _xx 1/((n+1)*m^2)
> xx_ 1/((n+1)*m^2)
> xxx (n-2)/((n+1)*m^3)

Hmm, but you're assuming that the m letters are all equally frequent,
which is surely not true in normal text.

I didn't have a machine-readable copy of Hemingway or Tolstoy at hand,
but I do have a text file with the collected works of H.P. Lovecraft,
so I tried analyzing the trigrams in that. I find that
* The average word length is 4.78 characters.
* There are 5652 distinct trigrams (discounting some containing digits),
the most common one (' t') occurring 81222 times; the average
occurrence count is 500.0.
* Considering only trigrams not containing any blanks, there are
4937 of them, the most common one ('the') occurring 45832 times,
with an average count of 266.9.
* There are (unsurprisingly) exactly 26 trigrams of the form ' x',
with an average count of 19598.5.
* There are 689 trigrams of the form ' xx' or 'xx ', the most common one
(' th') occurring 58200 times, with an average count of 1450.1.

So, we've rediscovered the fact that 'the' is the most common word in
English text ;-). But I think the significant conclusion for our purposes
here is that single-space trigrams are on average about 1450.1/266.9 = 5.4
times more common than the average space-free trigram, and two-space
trigrams about 19598.5/266.9 = 73.4 times more common.

So this leads me to the conclusion that we should be using a
BLANK_COLOR_SIZE value around 5 or 6 (with maybe something other than
a square-law rule for two-space trigrams). Even using your numbers,
it shouldn't be 32.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2014-02-09 21:13:37 Re: proposal, patch: allow multiple plpgsql plugins
Previous Message Alexander Korotkov 2014-02-09 19:28:54 Re: trgm regex index peculiarity