Re: Updated tsearch documentation

From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Updated tsearch documentation
Date: 2007-06-20 22:19:44
Message-ID: 200706202219.l5KMJiH05570@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-advocacy pgsql-hackers

Oleg Bartunov wrote:
> On Wed, 20 Jun 2007, Bruce Momjian wrote:
> >> Comments to editorial work of Bruce Momjian.
> >>
> >> fulltext-intro.sgml:
> >>
> >> it is useful to have a predefined list of lexemes.
> >>
> >> Bruce, here should be list of types of lexemes !
> >
> > Agreed. Are the list of lexemes parser-specific?
> >
>
> yes, it it parser which defines types of lexemes.

OK, how will users get a list of supported lexemes? Do we need a list
per supported parser?

> >> fulltext-opfunc.sgml:
> >>
> >> All of the following functions that accept a configuration argument can
> >> use either an integer <!-- why an integer --> or a textual configuration
> >> name to select a configuration.
> >>
> >> originally it was integer id, probably better use <type>oid</type>
> >
> > Uh, my question is why are you allowing specification as an integer/oid
> > when the name works just fine. I don't see the value in allowing
> > numbers here.
>
> for compatibility reason. Hmm, indeed, i don't recall where oid's could be
> important.

Well, if neither of ussee no reason for it, let's remove it. We don't
need to support a feature that has no usefulness.

> >> This returns the query used for searching an index. It can be used to test
> >> for an empty query. The <command>SELECT</> below returns <literal>'T'</>,
> >> <!-- lowercase? --> which corresponds to an empty query since GIN indexes
> >> do not support negate queries (a full index scan is inefficient):
> >>
> >>> capital case. This looks cumbersome, probably querytree() should
> >>> just return NULL.
> >
> > Agreed.
> >
> >> The integer option controls several behaviors which is done using bit-wise
> >> fields and <literal>|</literal> (for example, <literal>2|4</literal>):
> >> <!-- why so complex? -->
> >>
> >>> to avoid 2 arguments
> >
> > But I don't see why you would want to set two of those values --- they
> > seem mutually exclusive, e.g.
> >
> > 1 divides the rank by the 1 + logarithm of the document length
> > 2 divides the rank by the length itself
> >
> > I assume you do either one, not both.
>
> but what's about others variants ?

OK, here is the full list:

0 (the default) ignores document length
1 divides the rank by the 1 + logarithm of the document length
2 divides the rank by the length itself
4 divides the rank by the mean harmonic distance between extents
8 divides the rank by the number of unique words in document
16 divides the rank by 1 + logarithm of the number of unique words in
document

so which ones would be both enabled?

>
> What I missed is the definition of extent.
>
> >From http://www.sai.msu.su/~megera/wiki/NewExtentsBasedRanking
> Extent is a shortest and non-nested sequence of words, which satisfy a query.

I don't understand how that relates to this.

> >
> >> its <replaceable>id</replaceable> or <replaceable>ts_name</replaceable>; <!-- n
> >> if none is specified that the current configuration is used.
> >>
> >>> I don't understand this question
> >
> > Same issue as above --- why allow a number here when the name works just
> > fine. We don't allow tables to be specified by number, so why
> > configurations?
> >
> >> <para>
> >> <!-- why? -->
> >> Note that the cascade dropping of the <function>headline</function> function
> >> cause dropping of the <literal>parser</literal> used in fulltext configuration
> >> <replaceable>tsname</replaceable>.
> >> </para>
> >>
> >>> hmm, probably it should be reversed - cascade dropping of the parser cause
> >>> dropping of the headline function.
> >
> > Agreed.
> >
> >>
> >> In example below, <literal>fulltext_idx</literal> is
> >> a GIN index:<!-- why isn't this automatic -->
> >>
> >>> It's explained above. The problem is that current index api doesn't allow
> >>> to say if search was lossy or exact, so to preserve performance of
> >>> GIN index we had to introduce @@@ operator, which is the same as @@, but
> >>> lossy.
> >
> > Well, then we have to fix the API. Telling users to use a different
> > operator based on what index is defined is just bad style.
>
> This was raised by Heikki and we discussed it a bit in Ottawa, but it's
> unclear if it's doable for 8.3. @@@ operator is in rare use, so we could
> say it will be improved in future versions.

Uh, I am wondering if we just have to force heap access in all cases
until it is fixed.

> >> nly the <token>lword</token> lexeme, then a <acronym>TZ</acronym>
> >> definition like ' one 1:11' will not work since lexeme type
> >> <token>digit</token> is not assigned to the <acronym>TZ</acronym>.
> >> <!-- what do these numbers mean? -->
> >> </para>
> >
> > OK, I changed it to be clearer.
> >
> >>> nothing special, just numbers for example.
> >>
> >> <function>ts_debug</> displays information about every token of
> >> <replaceable class="PARAMETER">document</replaceable> as produced by the
> >> parser and processed by the configured dictionaries using the configuration
> >> specified by <replaceable class="PARAMETER">cfgname</replaceable> or
> >> <replaceable class="PARAMETER">oid</replaceable>. <!-- no need for oid
> >>
> >>> don't understand this comment. ts_debug accepts cfgname or its oid
> >
> > Again, no need for oid.
>
> We need to decide if we need oids as user-visible argument. I don't see
> any value, probably Teodor think other way.

This is a good time to clean up the API because there are going to be
user-visible changes anyway.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

In response to

Responses

Browse pgsql-advocacy by date

  From Date Subject
Next Message Oleg Bartunov 2007-06-21 10:11:12 Re: Updated tsearch documentation
Previous Message Scott Marlowe 2007-06-20 22:13:15 Re: [GENERAL] [PERFORM] Postgres VS Oracle

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2007-06-20 22:21:41 Re: GUC time unit spelling a bit inconsistent
Previous Message Gregory Stark 2007-06-20 22:10:09 Re: GUC time unit spelling a bit inconsistent