wparser misbehavior(?) for corner cases with hyphenated words

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: wparser misbehavior(?) for corner cases with hyphenated words
Date: 2007-10-24 00:00:58
Message-ID: 6269.1193184058@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

This does not seem right:

regression=# select alias,description,token from ts_debug('foo-8.3beta');
alias | description | token
-----------------+-------------------------------------+---------
numhword | Hyphenated word, letters and digits | foo-8.3
hword_asciipart | Hyphenated word part, all ASCII | foo
blank | Space symbols | -
float | Decimal notation | 8.3
hword_asciipart | Hyphenated word part, all ASCII | beta
(5 rows)

(Code from just before my last commit behaves the same, modulo names of
token types, so I didn't break it just now.)

Surely, if "beta" is an hword part here, it should have been reported as
part of the overall hword. However, this is all pretty inconsistent,
because if "8.3" had been in the first chunk of text then we'd not have
considered it part of an hword at all:

regression=# select alias,description,token from ts_debug('8.3beta-foo');
alias | description | token
-----------------+---------------------------------+----------
float | Decimal notation | 8.3
asciihword | Hyphenated word, all ASCII | beta-foo
hword_asciipart | Hyphenated word part, all ASCII | beta
blank | Space symbols | -
hword_asciipart | Hyphenated word part, all ASCII | foo
(5 rows)

regression=# select alias,description,token from ts_debug('beta8.3-foo');
alias | description | token
-------+-------------------+-------------
file | File or path name | beta8.3-foo
(1 row)

regression=# select alias,description,token from ts_debug('foo-beta8.3-foo');
alias | description | token
-----------------+------------------------------------------+-----------
numhword | Hyphenated word, letters and digits | foo-beta8
hword_asciipart | Hyphenated word part, all ASCII | foo
blank | Space symbols | -
hword_numpart | Hyphenated word part, letters and digits | beta8
blank | Space symbols | .
uint | Unsigned integer | 3
blank | Space symbols | -
asciiword | Word, all ASCII | foo
(8 rows)

I'm of the opinion that in no circumstance should "." be considered part
of an hword: the definition of word should not be allowed to stretch
beyond letters and digits. So I think the second and fourth examples
I showed above are correct. The third (where it concludes it's a
filename) is maybe a bit odd, but in any case it's not an hword so I won't
complain. I think the first example ought to parse as

asciiword foo
blank -
float 8.3
asciiword foo

(Or maybe the '-' should fold into the float? Don't care much...)

This is all a little bit tricky, since this behavior seems reasonable:

regression=# select alias,description,token from ts_debug('foo-83beta');
alias | description | token
-----------------+------------------------------------------+------------
numhword | Hyphenated word, letters and digits | foo-83beta
hword_asciipart | Hyphenated word part, all ASCII | foo
blank | Space symbols | -
hword_numpart | Hyphenated word part, letters and digits | 83beta
(4 rows)

regression=# select alias,description,token from ts_debug('83beta-foo');
alias | description | token
-----------------+------------------------------------------+------------
numhword | Hyphenated word, letters and digits | 83beta-foo
hword_numpart | Hyphenated word part, letters and digits | 83beta
blank | Space symbols | -
hword_asciipart | Hyphenated word part, all ASCII | foo
(4 rows)

Basically I'm arguing that a string should be considered valid as a
second or subsequent component of an hword if and only if it would be
considered valid as the first component.

Comments?

regards, tom lane

Browse pgsql-hackers by date

  From Date Subject
Next Message Josh Berkus 2007-10-24 00:07:25 Re: Feature Freeze date for 8.4
Previous Message Tom Lane 2007-10-23 23:42:20 Re: Feature Freeze date for 8.4