Re: [HACKERS] Index greater than 8k

From: Darcy Buskermolen <darcyb(at)commandprompt(dot)com>
To: Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, PgSQL General <pgsql-general(at)postgresql(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] Index greater than 8k
Date: 2006-10-31 18:49:37
Message-ID: 200610311049.37817.darcyb@commandprompt.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers

On October 31, 2006 08:53 am, Teodor Sigaev wrote:
> > The problem as I remember it is pg_tgrm not tsearch2 directly, I've sent
> > a self contained test case directly to Teodor which shows the error.
> >
> > 'ERROR: index row requires 8792 bytes, maximum size is 8191'
>
> Uh, I see. But I'm really surprised why do you use pg_trgm on big text?
> pg_trgm is designed to find similar words and use technique known as
> trigrams. This will work good on small pieces of text such as words or set
> expression. But all big texts (on the same language) will be similar :(.
> So, I didn't take care about guarantee that index tuple's size limitation.
> In principle, it's possible to modify pg_trgm to have such guarantee, but
> index becomes lossy - all tuples gotten from index should be checked by
> table's tuple evaluation.

The problem is some of the data we are working with is not strictly "text" but
bytea that we've run through encode(bytea, 'escape'), and we've had to resort
to trigrams in an attempt to mimic LIKE for searches. From our findings
tsearch2 does not match partial words, in the same way that a LIKE would. ie
col LIKE 'go%' would match good, gopher. pg_tgrm will return those with the
limit set appropriately, but tsearch2 does not.

>
> If you want to search similar documents I can recommend to have a look to
> fingerprint technique (http://webglimpse.net/pubs/TR93-33.pdf). It's pretty
> close to trigrams and metrics of similarity is the same, but uses another
> signature calculations. And, there are some tips and trics: removing HTML
> marking,removing punctuation, lowercasing text and so on - it's interesting
> and complex task.

--
Darcy Buskermolen
Command Prompt, Inc.
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
PostgreSQL solutions since 1997
http://www.commandprompt.com/

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Geoffrey 2006-10-31 19:00:12 Re: updating to 7.4.13 helped it appears
Previous Message Alvaro Herrera 2006-10-31 18:31:55 Re: updating to 7.4.13 helped it appears

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2006-10-31 19:29:00 Re: [HACKERS] Index greater than 8k
Previous Message Chuck McDevitt 2006-10-31 18:32:42 Re: [HACKERS] Case Preservation disregarding case