Re: [HACKERS] Index greater than 8k

From: Teodor Sigaev <teodor(at)sigaev(dot)ru>
To: Darcy Buskermolen <darcyb(at)commandprompt(dot)com>
Cc: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, PgSQL General <pgsql-general(at)postgresql(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] Index greater than 8k
Date: 2006-10-31 16:53:07
Message-ID: 45477F73.8050904@sigaev.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers

> The problem as I remember it is pg_tgrm not tsearch2 directly, I've sent a
> self contained test case directly to Teodor which shows the error.
>
> 'ERROR: index row requires 8792 bytes, maximum size is 8191'
Uh, I see. But I'm really surprised why do you use pg_trgm on big text? pg_trgm
is designed to find similar words and use technique known as trigrams. This will
work good on small pieces of text such as words or set expression. But all big
texts (on the same language) will be similar :(. So, I didn't take care about
guarantee that index tuple's size limitation. In principle, it's possible to
modify pg_trgm to have such guarantee, but index becomes lossy - all tuples
gotten from index should be checked by table's tuple evaluation.

If you want to search similar documents I can recommend to have a look to
fingerprint technique (http://webglimpse.net/pubs/TR93-33.pdf). It's pretty
close to trigrams and metrics of similarity is the same, but uses another
signature calculations. And, there are some tips and trics: removing HTML
marking,removing punctuation, lowercasing text and so on - it's interesting and
complex task.
--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Geoffrey 2006-10-31 17:03:39 updating to 7.4.13 helped it appears
Previous Message Tim Tassonis 2006-10-31 16:51:19 Re: WAL Archiving under Windows

Browse pgsql-hackers by date

  From Date Subject
Next Message Chuck McDevitt 2006-10-31 17:11:16 Re: [HACKERS] Case Preservation disregarding case
Previous Message Tom Lane 2006-10-31 16:23:40 Re: [HACKERS] WAL logging freezing