Re: Patch: add conversion from pg_wchar to multibyte

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Patch: add conversion from pg_wchar to multibyte
Date: 2012-05-02 13:35:03
Message-ID: CAPpHfduGy+ubDZNumLe9b_z_AFmrNYUbfNR-_A1gkqYPwASKWw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, May 2, 2012 at 4:50 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Tue, May 1, 2012 at 6:02 PM, Alexander Korotkov <aekorotkov(at)gmail(dot)com>
> wrote:
> > Right. When number of trigrams is big, it is slow to scan posting list of
> > all of them. The solution is this case is to exclude most frequent
> trigrams
> > from index scan. But, it require some kind of statistics of trigrams
> > frequencies which we don't have. We could estimate frequencies using some
> > hard-coded assumptions about natural languages. Or we could exclude
> > arbitrary trigrams. But I don't like both these ideas. This problem is
> also
> > relevant for LIKE/ILIKE search using trigram indexes.
>
> I was thinking you could perhaps do it just based on the *number* of
> trigrams, not necessarily their frequency.
>

Imagine we've two queries:
1) SELECT * FROM tbl WHERE col LIKE '%abcd%';
2) SELECT * FROM tbl WHERE col LIKE '%abcdefghijk%';

The first query require reading posting lists of trigrams "abc" and "bcd".
The second query require reading posting lists of trigrams "abc", "bcd",
"cde", "def", "efg", "fgh", "ghi", "hij" and "ijk".
We could decide to use index scan for first query and sequential scan for
second query because number of posting list to read is high. But it is
unreasonable because actually second query is narrower than the first one.
We can use same index scan for it, recheck will remove all false positives.
When number of trigrams is high we can just exclude some of them from index
scan. It would be better than just decide to do sequential scan. But the
question is what trigrams to exclude? Ideally we would leave most rare
trigrams to make index scan cheaper.

> > Probably you have some comments on idea of conversion from pg_wchar to
> > multibyte? Is it acceptable at all?
>
> Well, I'm not an expert on encodings, but it seems like a logical
> extension of what we're doing right now, so I don't really see why
> not. I'm confused by the diff hunks in pg_mule2wchar_with_len,
> though. Presumably either the old code is right (in which case, don't
> change it) or the new code is right (in which case, there's a bug fix
> needed here that ought to be discussed and committed separately from
> the rest of the patch). Maybe I am missing something.

Unfortunately I didn't understand original logic of pg_mule2wchar_with_len.
I just did proposal about how it could be. I hope somebody more familiar
with this code would clarify this situation.

------
With best regards,
Alexander Korotkov.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2012-05-02 13:48:33 Re: Patch: add conversion from pg_wchar to multibyte
Previous Message Peter Geoghegan 2012-05-02 13:33:56 Re: Have we out-grown Flex?