Fix XML handling with DOCTYPE

From: Ryan Lambert <ryan(at)rustprooflabs(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Fix XML handling with DOCTYPE
Date: 2019-03-16 20:10:56
Message-ID: CAN-V+g-6JqUQEQZ55Q3toXEN6d5Ez5uvzL4VR+8KtvJKj31taw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi all,

I'm investigating the issue I reported here:
https://www.postgresql.org/message-id/flat/153478795159.1302.9617586466368699403%40wrigleys.postgresql.org

As Tom Lane mentioned there, the docs (8.13) indicate xmloption = CONTENT
should accept all valid XML. At this time, XML with a DOCTYPE declaration
is not accepted with this setting even though it is considered valid XML.
I'd like to work on a patch to address this issue and make it work as
advertised.

I traced the source of the error to line ~1500 in
/src/backend/utils/adt/xml.c

res_code = xmlParseBalancedChunkMemory(doc, NULL, NULL, 0, utf8string +
count, NULL);

It looks like it is xmlParseBalancedChunkMemory from libxml that doesn't
work when there's a DOCTYPE in the XML data. My assumption is the DOCTYPE
element makes the XML not well-balanced. From:

http://xmlsoft.org/html/libxml-parser.html#xmlParseBalancedChunkMemory

This function returns:

> 0 if the chunk is well balanced, -1 in case of args problem and the parser
> error code otherwise

I see xmlParseBalancedChunkMemoryRecover that might provide the
functionality needed. That function returns:

0 if the chunk is well balanced, -1 in case of args problem and the parser
> error code otherwise In case recover is set to 1, the nodelist will not be
> empty even if the parsed chunk is not well balanced, assuming the parsing
> succeeded to some extent.

I haven't tested yet to see if this parses the data w/ DOCTYPE successfully
yet. If it does, I don't think it would be difficult to update the check
on res_code to not fail. I'm making another assumption that there is a
distinct code from libxml to differentiate from other errors, but I
couldn't find those codes quickly. The current check is this:

if (res_code != 0 || xmlerrcxt->err_occurred)

Does this sound reasonable? Have I missed some major aspect? If this is
on the right track I can work on creating a patch to move this forward.

Thanks,

*Ryan Lambert*
RustProof Labs
www.rustprooflabs.com

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Chapman Flack 2019-03-16 20:31:18 Re: Fix XML handling with DOCTYPE
Previous Message Heikki Linnakangas 2019-03-16 20:05:08 Re: Making all nbtree entries unique by having heap TIDs participate in comparisons