Re: Full Text Search ideas

From: Steve Grey <stevegrey78(at)gmail(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Full Text Search ideas
Date: 2010-07-19 09:20:45
Message-ID: AANLkTimNSKpZFkXs3z5GXSlzT887MuB4fYLs0uGjRAYL@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 19 July 2010 01:46, Howard Rogers <hjr(at)diznix(dot)com> wrote:

> On Mon, Jul 19, 2010 at 6:16 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > Howard Rogers <hjr(at)diznix(dot)com> writes:
> >> ims=# select count(*) from search_rm
> >> where to_tsvector('english', textsearch) @@ to_tsquery('english','bat &
> sb12n');
> >> count
> >> -------
> >> 3849
> >> (1 row)
> >
> >> Time: 408.962 ms
> >
> >> ims=# select count(*) from search_rm
> >> where to_tsvector('english', textsearch) @@ to_tsquery('english','bat &
> !sb12y');
> >> count
> >> -------
> >> 3849
> >> (1 row)
> >
> >> Time: 11.533 ms
> >
> > Yeah, I imagine that the first of these will involve examining all the
> > index entries for sb12n. There's not a lot of smarts about that inside
> > the GIN index machinery, AFAIK: it'll just fetch all the relevant TIDs
> > for both terms and then AND them.
> >
> >> I'm wondering firstly if there's any way I can configure PostgreSQL
> >> FTS so that it produces the sort of results we've gotten used to from
> >> Oracle, i.e., where search speeds do not go up wildly when a 'search
> >> term' is applied that happens to be used by the vast majority of
> >> document records.
> >
> > If you're willing to split out the search terms that are like this,
> > you could probably get better results with something like
> >
> > select count(*) from search_rm
> > where to_tsvector('english', textsearch) @@ to_tsquery('english','bat')
> AND
> > to_tsvector('english', textsearch) @@ to_tsquery('english','sb12n');
> >
> > That will put it on the optimizer's head as to whether to use the index
> > for one term or both terms.
> >
> > It might be worth noting that the optimizer will of course not get this
> > right unless it has decent statistics about both search terms --- and
> > there is an as-yet-unreleased patch about tsvector stats gathering:
> > http://archives.postgresql.org/pgsql-committers/2010-05/msg00360.php
> > I am not sure that the situation addressed by that patch applies in
> > your case, but it might.
> >
> > regards, tom lane
> >
>
> Thanks, Tom.
>
> The breaking out into separate search terms does make a difference,
> but not much:
>
> ims=# select count(*) from search_rm
> where to_tsvector('english',textsearch) @@ to_tsquery('english','bat &
> sb12n');
> count
> -------
> 3849
> (1 row)
>
> Time: 413.329 ms
>
> ims=# select count(*) from search_rm
> ims-# where to_tsvector('english',textsearch) @@
> to_tsquery('english','bat') AND
> ims-# to_tsvector('english',textsearch) @@ to_tsquery('english','sb12n');
> count
> -------
> 3849
> (1 row)
>
> Time: 352.583 ms
>
> So it's shaving about a sixth of the time off, which isn't bad, but
> not spectacularly good either!
>
> I'd also thought of trying something like this:
>
> ims=# select count(*) from
> (
> select * from search_rm where
> to_tsvector('english',textsearch) @@ to_tsquery('english','bat')
> ) as core
> where to_tsvector('english',textsearch) @@ to_tsquery('english','sb12n');
>
> count
> -------
> 3849
> (1 row)
>
> Time: 357.248 ms
>
> ...in the hope that the sb12n test would only be applied to the set of
> 'bat' records acquired by the inner query. But as you can tell from
> the time, that's not particularly better or worse than your suggestion
> (bearing mind that 'bat' on its own is a 12ms search).
>
> I'm currently constructing a separate column containing a single
> bitmask value for about 15 of the 45 attributes, just to see if
> evaluating the bits with a bitand test for the bat records is faster
> than trying to FTS them in the first place. Something like
>
> select count (*) from
> (
> select * from search_rm where
> to_tsvector('english',textsearch) @@ to_tsquery('english','bat')
> ) as core
> where bitand(searchbits,4096)>0;
>
> But it's taking a while to get that extra column constructed in the
> original table!
>
> Fingers crossed, because if not, it's all a bit of a show-stopper for
> our migration effort, I think. :-(
>
> Regards & thanks
> HJR
>
> --
> Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>

Hi Howard,

As well as trying the bitand test, have you tried a plain %like% or a regex
match/extraction on the results of performing the fts search purely on the
search terms? I'm guessing here it would involve more calculation in the
search than the bitand approach, but might require less maintenance and,
along the direction you are already heading, place more emphasis on the
refinement of candidate matches rather than the retrieval of better matches
in the first instance, and perhaps more so for non-exhaustive searching.

Regards,

Steve

In response to

Browse pgsql-general by date

  From Date Subject
Next Message A. Kretschmer 2010-07-19 09:24:43 Re: Rescue data after power off
Previous Message tuanhoanganh 2010-07-19 09:15:59 Rescue data after power off