Re: [HACKERS] Index greater than 8k

From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: "Gregory S(dot) Williamson" <gsw(at)globexplorer(dot)com>
Cc: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, Darcy Buskermolen <darcyb(at)commandprompt(dot)com>, PgSQL General <pgsql-general(at)postgresql(dot)org>
Subject: Re: [HACKERS] Index greater than 8k
Date: 2006-11-03 08:54:05
Message-ID: Pine.GSO.4.63.0611031151070.8413@ra.sai.msu.su
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers

Gregory,

All you described you do with fti is possible with tsearch2. Just need
some think, of course. If you don't need stemming, just don't use it,
if you need something like %txt%, just write simple dictionary, which
produce any substrings from input word.

Oleg
On Tue, 31 Oct 2006, Gregory S. Williamson wrote:

> I hesitate to mention it, since it's retrograde, uses OIDS, may not handle your locale/encoding correctly, may not scale well for what you need etc., etc.
>
> But we've used fti (in the contrib package) to do fast searches for any bit of text in people's names ... we didn't go with tesearch2 because we were a bit worried about the need to search for fragments of names, and that names don't follow stemming rules and the like very well. Still it might be a way of handling some of the uglier data. It was a bit of a pain to set up but seems to work well. Of course, users can ask for something commonplace and get back gazillions of rows, but apparently that's ok for the application this is part of. Caveat: only about 32 million rows in this dataset, partitioned into unequal grouings (about 90 total).
>
> HTH (but doubt it for reasons that undoubtedly be made clear ;-)
>
> Greg Williamson
> DBA
> GlobeXplorer LLC
>
>
> -----Original Message-----
> From: pgsql-general-owner(at)postgresql(dot)org on behalf of Joshua D. Drake
> Sent: Tue 10/31/2006 7:46 PM
> To: Teodor Sigaev
> Cc: Darcy Buskermolen; PgSQL General; PostgreSQL-development
> Subject: Re: [HACKERS] [GENERAL] Index greater than 8k
>
> Teodor Sigaev wrote:
>>> The problem as I remember it is pg_tgrm not tsearch2 directly, I've
>>> sent a self contained test case directly to Teodor which shows the
>>> error.
>>> 'ERROR: index row requires 8792 bytes, maximum size is 8191'
>> Uh, I see. But I'm really surprised why do you use pg_trgm on big text?
>> pg_trgm is designed to find similar words and use technique known as
>> trigrams. This will work good on small pieces of text such as words or
>> set expression. But all big texts (on the same language) will be similar
>> :(. So, I didn't take care about guarantee that index tuple's size
>> limitation. In principle, it's possible to modify pg_trgm to have such
>> guarantee, but index becomes lossy - all tuples gotten from index
>> should be checked by table's tuple evaluation.
>
> We are trying to get something faster than ~ '%foo%';
>
> Which Tsearch2 does not give us :)
>
> Joshua D. Drake
>
>
>
>>
>> If you want to search similar documents I can recommend to have a look
>> to fingerprint technique (http://webglimpse.net/pubs/TR93-33.pdf). It's
>> pretty close to trigrams and metrics of similarity is the same, but uses
>> another signature calculations. And, there are some tips and trics:
>> removing HTML marking,removing punctuation, lowercasing text and so on -
>> it's interesting and complex task.
>
>
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Peter 2006-11-03 09:51:46 Re: Weird double single quote issue
Previous Message Albe Laurenz 2006-11-03 08:26:10 Re: Does anyone have a compiled dblink_tds for Redhat EL 4.x?

Browse pgsql-hackers by date

  From Date Subject
Next Message Magnus Hagander 2006-11-03 09:38:32 Re: Design Considerations for New Authentication Methods
Previous Message Simon Riggs 2006-11-03 08:06:55 Re: "recovering prepared transaction" after serverrestart message