Quick Links

Re: Fix XML handling with DOCTYPE

From:	Chapman Flack <chap(at)anastigmatix(dot)net>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Ryan Lambert <ryan(at)rustprooflabs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Fix XML handling with DOCTYPE
Date:	2019-03-16 21:11:29
Message-ID:	5C8D6681.4070504@anastigmatix.net
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 03/16/19 16:55, Tom Lane wrote:
> What do you think of the idea I just posted about parsing off the DOCTYPE
> thing for ourselves, and not letting libxml see it?

The principled way of doing that would be to pre-parse to find a DOCTYPE,
and if there is one, leave it there and parse the input as we do for
'document'. Per XML, if there is a DOCTYPE, the document must satisfy
the 'document' syntax requirements, and per SQL/XML:2006-and-later,
'content' is a proper superset of 'document', so if we were asked for
'content' and can successfully parse it as 'document', we're good,
and if we see a DOCTYPE and yet it incurs a parse error as 'document',
well, that's what needed to happen.

The DOCTYPE can appear arbitrarily far in, but the only things that
can precede it are the XML decl, whitespace, XML comments, and XML
processing instructions. None of those things nest, so the preceding
stuff makes a regular language, and a regular expression that matches
any amount of that stuff ending in <!DOCTYPE would be enough to detect
that the parse should be shunted off to get 'document' treatment.

The patch I submitted essentially relies on libxml to do that same
parsing up to that same point and detect the error, so it would add
no upfront cost in the majority of cases that aren't this corner one.

But keeping a little compiled regex around and testing the input with that
would hardly break the bank, either.

Regards,
-Chap

In response to

Re: Fix XML handling with DOCTYPE at 2019-03-16 20:55:38 from Tom Lane

Responses

Re: Fix XML handling with DOCTYPE at 2019-03-16 21:21:12 from Tom Lane

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2019-03-16 21:21:12	Re: Fix XML handling with DOCTYPE
Previous Message	Peter Geoghegan	2019-03-16 21:07:28	Re: Making all nbtree entries unique by having heap TIDs participate in comparisons