Re: trgm regex index peculiarity

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: trgm regex index peculiarity
Date: 2014-01-15 23:34:16
Message-ID: 4963.1389828856@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Alexander Korotkov <aekorotkov(at)gmail(dot)com> writes:
> Revised version of patch with necessary comments.

I looked at this patch a bit. It seems like this:

+ * BLANK_COLOR_SIZE - How much blank character is more frequent than
+ * other character in average
+ #define BLANK_COLOR_SIZE 32

is a drastic overestimate. Using the program Erik provided to generate
some sample data, I find that blanks in trigrams are maybe about three
times more common than other characters. Possibly Erik's data isn't
very representative of real text, but if that's the case we'd better
get a more representative sample case before we start optimizing ...

Now maybe the above number is okay for this purpose anyway, but if so
it needs some different justification; maybe call it a "penalty"?
But nonetheless it seems like a darn large penalty, particularly for " x"
type trigrams where the value would effectively be squared. Compared to
the proposed values of MAX_TRGM_SIZE and WISH_TRGM_SIZE, it seems like
you might as well forget the "sizing" approach and just flat out reject
trigrams containing COLOR_BLANK, because they'll never get through the
size filter. I'm inclined to think you need a larger MAX_TRGM_SIZE;
and WISH_TRGM_SIZE seems mighty small as well, compared to what the
code would have done before.

Also, surely this bit:

! trgmNFA->totalTrgmCount = (int) totalTrgmSize;

is flat out wrong? expandColorTrigrams() expects that
trgmNFA->totalTrgmCount is the truth, not some penalty-bloated
abstraction. The fact that the patch hasn't failed on you completely
demonstrates that you've not tested any cases where blank-containing
trigrams got through the filter, reinforcing my feeling that it's
probably too strict now.

I wonder if it would be more appropriate to continue to measure the limit
MAX_TRGM_COUNT in terms of actual trigram counts, and just use the
penalized "size" as the sort key while determining which color trigrams
to discard first.

Another thought is that it's not clear that you should apply the same
penalty to blanks in all three positions. Because of the padding
settings, any one word will normally lead to padded strings " a",
" ab", "yz ", so that spaces in the first position are about twice
as common as spaces in the other positions. (It's a little less
than that because single-letter words produce only " a" and " a ";
but I'd think that such words aren't very common in text that trigrams
are effective for.) So I'm thinking the penalty ought to take that
into account.

I'm also inclined to think that it might be worth adding a size
field to ColorTrgmInfo rather than repeatedly calculating the
size estimate. We only allow a thousand and some ColorTrgmInfos
at most, so the extra space wouldn't be that large, and I'm concerned
by the expense the current patch adds to the sort comparator.

Another point is that this comment:

* Note: per-color-trigram counts cannot overflow an int so long as
* COLOR_COUNT_LIMIT is not more than the cube root of INT_MAX, ie about
* 1290. However, the grand total totalTrgmCount might conceivably
* overflow an int, so we use int64 for that within this routine.

is no longer valid, or at least it fails to demonstrate that the result
of getColorTrigramSize() can't overflow an int; indeed I rather fear it can.
The patch failed to even update the variable name in this comment, let
alone address the substantive question.

There are some other cosmetic things I didn't like, for instance
colorTrgmInfoCountCmp() is no longer comparing counts but you didn't
rename it. That's about it for substantive comments though.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dave Chinner 2014-01-15 23:57:35 Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance
Previous Message Alvaro Herrera 2014-01-15 23:31:21 Re: Backup throttling