Skip site navigation (1) Skip section navigation (2)

Re: [HACKERS] fulltext parser strange behave

From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, "Patches (PostgreSQL)" <pgsql-patches(at)postgresql(dot)org>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Subject: Re: [HACKERS] fulltext parser strange behave
Date: 2007-11-19 13:31:43
Message-ID: 4741903F.50006@dunslane.net (view raw or flat)
Thread:
Lists: pgsql-hackerspgsql-patches

Andrew Dunstan wrote:
>
>
> Tom Lane wrote:
>> Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
>>  
>>> I've just been looking at the state machine in wparser_def.c. I 
>>> think the processing for entities is also a few bob short in the 
>>> pound. It recognises decimal numeric character references, but nor 
>>> hexadecimal numeric character references. That's fairly silly since 
>>> the HTML spec specifically says the latter are "particularly 
>>> useful". The rules for named entities are also deficient w.r.t. 
>>> digits, just like the case of tags that Tom noticed. This isn't 
>>> academic: HTML features a number of named entities with digits in 
>>> the name (sup2, frac14 for example).
>>>     
>>
>>  
>>> In XML at least, legal names are defined by the following rules from 
>>> the spec:
>>> ...
>>> [A-Za-z:_][A-Za-z0-9:_.-]*
>>>     
>>
>>  
>>> I suggest we use that or something very close to it as the rule for 
>>> names in these patterns.
>>>     
>>
>> No objections here.  Who wants to patch wparser_def?
>>
>>            
>>   
>
>
> I can get to it some time in the next week. - rather snowed under 
> right now.
>
> BTW, I'm also suspicious of the clause that allows <?xml ... it 
> appears that it will allow <?xfoo  and <?XFOO also, which seems quite 
> odd, especially the latter.
>

Here's a patch that fixes the patterns for numeric entities, tag names, 
and removes the upper case 'X' case in the special case for an XML 
prolog. There are still some oddities, but I decided against making 
heroic efforts to fix them. It's probably less important if the patterns 
are slightly too liberal (e.g. accepting <a href="qwe<qwe>"> ) than if 
they don't recognize what they are alleged to recognize.


cheers

andrew



Attachment: tsfix.patch
Description: text/x-patch (8.7 KB)

In response to

Responses

pgsql-hackers by date

Next:From: Alvaro HerreraDate: 2007-11-19 13:58:38
Subject: Re: LDC - Load Distributed Checkpoints with PG8.3b2 onSolaris
Previous:From: Alvaro HerreraDate: 2007-11-19 12:20:21
Subject: Re: VACUUM/ANALYZE counting of in-doubt tuples

pgsql-patches by date

Next:From: Marko KreenDate: 2007-11-19 14:26:14
Subject: Re: hashlittle(), hashbig(), hashword() and endianness
Previous:From: Gregory StarkDate: 2007-11-19 08:15:17
Subject: Re: Better default_statistics_target

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group