Re: Updated tsearch documentation

From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Updated tsearch documentation
Date: 2007-06-20 20:44:53
Message-ID: Pine.LNX.4.64.0706210029410.1881@sn.sai.msu.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-advocacy pgsql-hackers

On Wed, 20 Jun 2007, Bruce Momjian wrote:

> Oleg Bartunov wrote:
>> On Sun, 17 Jun 2007, Bruce Momjian wrote:
>>
>>> I have completed my first pass over the tsearch documentation:
>>>
>>> http://momjian.us/expire/fulltext/HTML/sql.html
>>>
>>> They are from section 14 and following.
>>>
>>> I have come up with a number of questions that I placed in SGML comments
>>> in these files:
>>>
>>> http://momjian.us/expire/fulltext/SGML/
>>>
>>> Teodor/Oleg, let me know when you want to go over my questions.
>>
>> Below are my answers (marked as )
>
> OK.
>>
>> Comments to editorial work of Bruce Momjian.
>>
>> fulltext-intro.sgml:
>>
>> it is useful to have a predefined list of lexemes.
>>
>> Bruce, here should be list of types of lexemes !
>
> Agreed. Are the list of lexemes parser-specific?
>

yes, it it parser which defines types of lexemes.

>> fulltext-opfunc.sgml:
>>
>> All of the following functions that accept a configuration argument can
>> use either an integer <!-- why an integer --> or a textual configuration
>> name to select a configuration.
>>
>> originally it was integer id, probably better use <type>oid</type>
>
> Uh, my question is why are you allowing specification as an integer/oid
> when the name works just fine. I don't see the value in allowing
> numbers here.

for compatibility reason. Hmm, indeed, i don't recall where oid's could be
important.

>
>> This returns the query used for searching an index. It can be used to test
>> for an empty query. The <command>SELECT</> below returns <literal>'T'</>,
>> <!-- lowercase? --> which corresponds to an empty query since GIN indexes
>> do not support negate queries (a full index scan is inefficient):
>>
>>> capital case. This looks cumbersome, probably querytree() should
>>> just return NULL.
>
> Agreed.
>
>> The integer option controls several behaviors which is done using bit-wise
>> fields and <literal>|</literal> (for example, <literal>2|4</literal>):
>> <!-- why so complex? -->
>>
>>> to avoid 2 arguments
>
> But I don't see why you would want to set two of those values --- they
> seem mutually exclusive, e.g.
>
> 1 divides the rank by the 1 + logarithm of the document length
> 2 divides the rank by the length itself
>
> I assume you do either one, not both.

but what's about others variants ?

What I missed is the definition of extent.

>From http://www.sai.msu.su/~megera/wiki/NewExtentsBasedRanking
Extent is a shortest and non-nested sequence of words, which satisfy a query.

>
>> its <replaceable>id</replaceable> or <replaceable>ts_name</replaceable>; <!-- n
>> if none is specified that the current configuration is used.
>>
>>> I don't understand this question
>
> Same issue as above --- why allow a number here when the name works just
> fine. We don't allow tables to be specified by number, so why
> configurations?
>
>> <para>
>> <!-- why? -->
>> Note that the cascade dropping of the <function>headline</function> function
>> cause dropping of the <literal>parser</literal> used in fulltext configuration
>> <replaceable>tsname</replaceable>.
>> </para>
>>
>>> hmm, probably it should be reversed - cascade dropping of the parser cause
>>> dropping of the headline function.
>
> Agreed.
>
>>
>> In example below, <literal>fulltext_idx</literal> is
>> a GIN index:<!-- why isn't this automatic -->
>>
>>> It's explained above. The problem is that current index api doesn't allow
>>> to say if search was lossy or exact, so to preserve performance of
>>> GIN index we had to introduce @@@ operator, which is the same as @@, but
>>> lossy.
>
> Well, then we have to fix the API. Telling users to use a different
> operator based on what index is defined is just bad style.

This was raised by Heikki and we discussed it a bit in Ottawa, but it's
unclear if it's doable for 8.3. @@@ operator is in rare use, so we could
say it will be improved in future versions.

>
>> nly the <token>lword</token> lexeme, then a <acronym>TZ</acronym>
>> definition like ' one 1:11' will not work since lexeme type
>> <token>digit</token> is not assigned to the <acronym>TZ</acronym>.
>> <!-- what do these numbers mean? -->
>> </para>
>
> OK, I changed it to be clearer.
>
>>> nothing special, just numbers for example.
>>
>> <function>ts_debug</> displays information about every token of
>> <replaceable class="PARAMETER">document</replaceable> as produced by the
>> parser and processed by the configured dictionaries using the configuration
>> specified by <replaceable class="PARAMETER">cfgname</replaceable> or
>> <replaceable class="PARAMETER">oid</replaceable>. <!-- no need for oid
>>
>>> don't understand this comment. ts_debug accepts cfgname or its oid
>
> Again, no need for oid.

We need to decide if we need oids as user-visible argument. I don't see
any value, probably Teodor think other way.

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

In response to

Responses

Browse pgsql-advocacy by date

  From Date Subject
Next Message Scott Marlowe 2007-06-20 22:13:15 Re: [GENERAL] [PERFORM] Postgres VS Oracle
Previous Message Bruce Momjian 2007-06-20 20:24:11 Re: Updated tsearch documentation

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2007-06-20 20:55:34 Re: Load Distributed Checkpoints test results
Previous Message Bruce Momjian 2007-06-20 20:44:14 Re: Load Distributed Checkpoints test results