Re: tsearch patch and namespace pollution

From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Teodor Sigaev <teodor(at)sigaev(dot)ru>
Subject: Re: tsearch patch and namespace pollution
Date: 2007-08-17 02:18:50
Message-ID: 200708170218.l7H2Io715461@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Tom Lane wrote:
> I find the following additions to pg_proc in the current tsearch2 patch:

It seems a lot of these are useless and just bloat. I will mark a few:

> proc | prorettype
> ------------------------------------------+------------
> pg_ts_parser_is_visible(oid) | boolean
> pg_ts_dict_is_visible(oid) | boolean
> pg_ts_template_is_visible(oid) | boolean
> pg_ts_config_is_visible(oid) | boolean

Why would anyone look these up via OID rather than name?

> tsvectorin(cstring) | tsvector
> tsvectorout(tsvector) | cstring
> tsvectorsend(tsvector) | bytea
> tsqueryin(cstring) | tsquery
> tsqueryout(tsquery) | cstring
> tsquerysend(tsquery) | bytea
> gtsvectorin(cstring) | gtsvector
> gtsvectorout(gtsvector) | cstring
> tsvector_lt(tsvector,tsvector) | boolean
> tsvector_le(tsvector,tsvector) | boolean
> tsvector_eq(tsvector,tsvector) | boolean
> tsvector_ne(tsvector,tsvector) | boolean
> tsvector_ge(tsvector,tsvector) | boolean
> tsvector_gt(tsvector,tsvector) | boolean
> tsvector_cmp(tsvector,tsvector) | integer
> length(tsvector) | integer
> strip(tsvector) | tsvector
> setweight(tsvector,"char") | tsvector
> tsvector_concat(tsvector,tsvector) | tsvector
> vq_exec(tsvector,tsquery) | boolean
> qv_exec(tsquery,tsvector) | boolean
> tt_exec(text,text) | boolean
> ct_exec(character varying,text) | boolean
> tq_exec(text,tsquery) | boolean
> cq_exec(character varying,tsquery) | boolean
> tsquery_lt(tsquery,tsquery) | boolean
> tsquery_le(tsquery,tsquery) | boolean
> tsquery_eq(tsquery,tsquery) | boolean
> tsquery_ne(tsquery,tsquery) | boolean
> tsquery_ge(tsquery,tsquery) | boolean
> tsquery_gt(tsquery,tsquery) | boolean
> tsquery_cmp(tsquery,tsquery) | integer
> tsquery_and(tsquery,tsquery) | tsquery
> tsquery_or(tsquery,tsquery) | tsquery
> tsquery_not(tsquery) | tsquery
> tsq_mcontains(tsquery,tsquery) | boolean
> tsq_mcontained(tsquery,tsquery) | boolean
> numnode(tsquery) | integer
> querytree(tsquery) | text
> rewrite(tsquery,tsquery,tsquery) | tsquery
> rewrite(tsquery,text) | tsquery
> rewrite_accum(tsquery,tsquery[]) | tsquery
> rewrite_finish(tsquery) | tsquery
> rewrite(tsquery[]) | tsquery
> stat(text) | record
> stat(text,text) | record

> rank(real[],tsvector,tsquery,integer) | real
> rank(real[],tsvector,tsquery) | real
> rank(tsvector,tsquery,integer) | real
> rank(tsvector,tsquery) | real
> rank_cd(real[],tsvector,tsquery,integer) | real
> rank_cd(real[],tsvector,tsquery) | real
> rank_cd(tsvector,tsquery,integer) | real
> rank_cd(tsvector,tsquery) | real

Do we realy need this many ranking functions?

> token_type(oid) | record

Again, why by OID?

> token_type(text) | record
> parse(oid,text) | record
> parse(text,text) | record
> lexize(oid,text) | text[]
> lexize(text,text) | text[]
> headline(oid,text,tsquery,text) | text
> headline(oid,text,tsquery) | text
> headline(text,text,tsquery,text) | text
> headline(text,text,tsquery) | text
> headline(text,tsquery,text) | text
> headline(text,tsquery) | text
> to_tsvector(oid,text) | tsvector
> to_tsvector(text,text) | tsvector
> to_tsquery(oid,text) | tsquery

Why OID again for the configuration? I just don't see the use case and
it is bloat and causes confusion.

> to_tsquery(text,text) | tsquery
> plainto_tsquery(oid,text) | tsquery
> plainto_tsquery(text,text) | tsquery

Again, OID. I asked Oleg about this and he said:

> Bruce, just remove oid argument specification from documentation.

so I think we can go ahead and remove cases where the configuration name
or object is specified by oid. I have already removed them from the
documentation and I though the patch had them removed too, but I guess
not. Admittedly this API has been in flux.

> to_tsvector(text) | tsvector
> to_tsquery(text) | tsquery
> plainto_tsquery(text) | tsquery
> tsvector_update_trigger() | trigger
> get_ts_config_oid(text) | oid
> get_current_ts_config() | oid
> (82 rows)
>
> (This list omits functions with INTERNAL arguments, as those are of
> no particular concern to users.)

Also @@ accepts TEXT @@ TEXT, at least according to the documentation.
Is it clear to anyone which is tsquery and which tsvector? Is this
something we want to support?

> While most of these are probably OK, I'm disturbed by the prospect
> that we are commandeering names as generic as "parse" or "stat"
> with argument types as generic as "text". I think we need to put
> a "ts_" prefix on some of these. Specifically, I find these names
> totally unacceptable without a ts_ prefix:
>
> stat(text) | record
> stat(text,text) | record
>
> token_type(oid) | record
> token_type(text) | record
>
> parse(oid,text) | record
> parse(text,text) | record
>
> lexize(oid,text) | text[]
> lexize(text,text) | text[]
>
> These guys might be all right given that some of their arguments are
> tsvector or tsquery, but it's not completely convincing --- think about
> the case where an argument is given as an undecorated literal string.
> It's also not all that clear that they are related to text searching.
> I'm for putting a ts_ prefix on them too:
>
> rank(real[],tsvector,tsquery,integer) | real
> rank(real[],tsvector,tsquery) | real
> rank(tsvector,tsquery,integer) | real
> rank(tsvector,tsquery) | real
> rank_cd(real[],tsvector,tsquery,integer) | real
> rank_cd(real[],tsvector,tsquery) | real
> rank_cd(tsvector,tsquery,integer) | real
> rank_cd(tsvector,tsquery) | real
>
> rewrite(tsquery,tsquery,tsquery) | tsquery
> rewrite(tsquery,text) | tsquery
> rewrite_accum(tsquery,tsquery[]) | tsquery
> rewrite_finish(tsquery) | tsquery
> rewrite(tsquery[]) | tsquery
>
> headline(oid,text,tsquery,text) | text
> headline(oid,text,tsquery) | text
> headline(text,text,tsquery,text) | text
> headline(text,text,tsquery) | text
> headline(text,tsquery,text) | text
> headline(text,tsquery) | text
>
> These guys are just plain badly named, as it's completely unobvious that
> they have anything to do with tsearch (or what they do at all, actually).
> Furthermore the "varchar" variants seem entirely redundant with the
> "text" ones:
>
> vq_exec(tsvector,tsquery) | boolean
> qv_exec(tsquery,tsvector) | boolean
> tt_exec(text,text) | boolean
> ct_exec(character varying,text) | boolean
> tq_exec(text,tsquery) | boolean
> cq_exec(character varying,tsquery) | boolean
>
> Comments, suggestions?

I would be happy if all text search functions began with 'ts', 'ts_', or
'to_ts', and if we don't clean this up now, it is going to be harder in
the future. I think users can expect some migration for text search in
8.3 as a benefit of getting into core and be dump-able.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2007-08-17 02:21:44 Re: GIT patch
Previous Message Joshua D. Drake 2007-08-17 02:17:24 Re: GIT patch