Re: Patch: add conversion from pg_wchar to multibyte

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Patch: add conversion from pg_wchar to multibyte
Date: 2012-05-01 22:02:23
Message-ID: CAPpHfdsfg7vcanUBRPJBzPJ5jETVw2sH5LBwpeac=R_C74QTag@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Apr 30, 2012 at 10:07 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Sun, Apr 29, 2012 at 8:12 AM, Erik Rijkers <er(at)xs4all(dot)nl> wrote:
> > Perhaps I'm too early with these tests, but FWIW I reran my earlier test
> program against three
> > instances. (the patches compiled fine, and make check was without
> problem).
>
> These tests results seem to be more about the pg_trgm changes than the
> patch actually on this thread, unless I'm missing something. But the
> executive summary seems to be that pg_trgm might need to be a bit
> smarter about costing the trigram-based search, because when the
> number of trigrams is really big, using the index is
> counterproductive. Hopefully that's not too hard to fix; the basic
> approach seems quite promising.

Right. When number of trigrams is big, it is slow to scan posting list of
all of them. The solution is this case is to exclude most frequent trigrams
from index scan. But, it require some kind of statistics of trigrams
frequencies which we don't have. We could estimate frequencies using some
hard-coded assumptions about natural languages. Or we could exclude
arbitrary trigrams. But I don't like both these ideas. This problem is also
relevant for LIKE/ILIKE search using trigram indexes.

Something similar could occur in tsearch when we search for "frequent_term
& rare_term". In some situations (depending on terms frequencies) it's
better to exclude frequent_term from index scan and do recheck. We have
relevant statistics to do such decision, but it doesn't seem to be feasible
to get it using current GIN interface.

Probably you have some comments on idea of conversion from pg_wchar to
multibyte? Is it acceptable at all?

------
With best regards,
Alexander Korotkov.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Korotkov 2012-05-01 22:08:30 Re: Patch: add conversion from pg_wchar to multibyte
Previous Message Alexander Korotkov 2012-05-01 21:45:57 Re: Patch: add conversion from pg_wchar to multibyte