Re: Why not keeping positions in GIN?

From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Hitoshi Harada <hitoshi_harada(at)forcia(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Why not keeping positions in GIN?
Date: 2007-05-28 13:30:44
Message-ID: Pine.LNX.4.64.0705281722520.12152@sn.sai.msu.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hitoshi,

there is no problem to write n-gram dictionary for tsearch2 ! The problem
is in how to define word boundary.

Oleg

On Sat, 26 May 2007, Hitoshi Harada wrote:

>> FYI, Tatsuo uses tsearch2 for indexing japanese documents. But I agree,
>> n-gram index would be more universal for asian languages.
> Yeah, I know, but in tsearch2 for japanese sample you must use external
> morphological analysis libraries to separate words. It is powerful but I
> need more "lightweight" approach. Also especially when you search for
> non-document(such like titles, names, or pattern in the genome), the
> approach above is not so useful.
>
> As I mentioned, GIN is also powerful for array data type search, so I am
> very expecting it will have additional information.
>
> Anyway, thanks a lot for much information. I try to read it.
>
> Regards,
>
> Hitoshi Harada
>
>> -----Original Message-----
>> From: Oleg Bartunov [mailto:oleg(at)sai(dot)msu(dot)su]
>> Sent: Saturday, May 26, 2007 10:12 PM
>> To: Hitoshi Harada
>> Cc: pgsql-hackers(at)postgresql(dot)org
>> Subject: Re: [HACKERS] Why not keeping positions in GIN?
>>
>> On Fri, 25 May 2007, Hitoshi Harada wrote:
>>
>>> Hi,
>>>
>>> I was walking through GIN am source code these days, and found that it
> has
>>> only posting lists but no positions related those.
>>>
>>> The reason I was doing that is, to try to implement n-gram text search
> index
>>> on GIN for myself. As you know Japanese is not like English or other
>>> European languages. If you write Japanese (or other 'not separated')
> text
>>> index by n-gram, it should have entry positions on the entry as well as
> the
>>> posting lists, because you must know if each split query key are joined
> with
>>> each other in the data. To know this, position must be there.
>>
>> FYI, Tatsuo uses tsearch2 for indexing japanese documents. But I agree,
>> n-gram index would be more universal for asian languages.
>>
>>>
>>> It's not only about Japanese. When you search "phrase" for text in
> English,
>>> the same logic above will be needed. I don't research about tsearch2 but
> is
>>> there any problem?? Also, in some case int-array inverted index needs
> the
>>> entry positions as well, I guess. Obtaining positions with posting lists
> is
>>> "general" enough for GIN, isn't it?
>>>
>>> Is there any future plan around it?
>>
>> Yes, we do have plans. See our todo,
> http://www.sai.msu.su/~megera/wiki/todo
>> You may read also FTSBOOK, http://www.sai.msu.su/~megera/postgres/fts/doc
>> and slides from PGCon2007,
>> http://www.sai.msu.su/~megera/postgres/talks/fts-pgcon2007.pdf
>>>
>>>
>>> Regards,
>>>
>>> Hitoshi Harada
>>>
>>>
>>>
>>> ---------------------------(end of broadcast)---------------------------
>>> TIP 4: Have you searched our list archives?
>>>
>>> http://archives.postgresql.org
>>>
>>
>> Regards,
>> Oleg
>> _____________________________________________________________
>> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
>> Sternberg Astronomical Institute, Moscow University, Russia
>> Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
>> phone: +007(495)939-16-83, +007(495)939-23-83
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2007-05-28 16:53:49 What is the maximum encoding-conversion growth rate, anyway?
Previous Message Jim C. Nasby 2007-05-27 19:01:36 Re: Maintaining cluster order on insert