Re: Another issue with invalid XML values

From: Noah Misch <noah(at)leadboat(dot)com>
To: Florian Pflug <fgp(at)phlo(dot)org>
Cc: PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Another issue with invalid XML values
Date: 2011-06-20 20:45:48
Message-ID: 20110620204548.GC17037@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jun 20, 2011 at 09:15:51PM +0200, Florian Pflug wrote:
> On Jun20, 2011, at 19:57 , Noah Misch wrote:
> > I tested this patch using two systems, a CentOS 5 box with
> > libxml2-2.6.26-2.1.2.8.el5_5.1 and an Ubuntu 8.04 box with libxml2
> > 2.6.31.dfsg-2ubuntu1.6. Both failed to build with this error:
> >
> > xml.c: In function `pg_xml_init':
> > xml.c:934: error: `xmlStructuredErrorContext' undeclared (first use in this function)
> > xml.c:934: error: (Each undeclared identifier is reported only once
> > xml.c:934: error: for each function it appears in.)
> > make[4]: *** [xml.o] Error 1
> >
> > It seems `xmlStructuredErrorContext' was added rather recently. It's not
> > obvious which version added it, but 2.7.8 has it and 2.7.2 does not. Looking
> > at 2.6.26's sources, that version seems to use `xmlGenericErrorContext' for
> > both structured and generic handler functions. Based on that, I tried with a
> > change from `xmlStructuredErrorContext' to `xmlGenericErrorContext' in our
> > xml.c. I ran the test suite and hit some failures in the xml test; see
> > attached regression.diffs. I received warnings where the tests expected
> > nothing or expected errors.
>
> Great :-(. I wonder if maybe I should simply remove the part of the patch
> which try to restore the error handler and context in pg_xml_done(). I've
> added that because I feared that some 3rd-party libraries setup their libxml
> error handle just once, not before every library class. The code previously
> didn't care about this either, but since we're using structured instead of
> generic error reporting now, we'd now collide with 3-rd party libs which
> also use structured error handling, whereas previously that would have worked.
> OTOH, once there's more than one such library...
>
> Whats your stand on this?

Assuming it's sufficient to save/restore xmlStructuredErrorContext when it's
available and xmlGenericErrorContext otherwise, I would just add an Autoconf
check to identify which one to use.

> > I couldn't reconcile this description with the code. I do see that
> > xpath_internal() uses XML_PURPOSE_OTHER, but so does xmlelement(). Why is
> > xmlelement() also ready for the strictest error reporting?
>
> xmlelement() doesn't use libxml's parser, only the xml writer. I didn't
> want to suppress any errors the writer might raise (I'm not sure it raises
> error at all, though).

Makes sense.

> > Also note that we may be able to be more strict when xmloption = document.
>
> Yeah, probably. I think that better done in a separate patch, though - I've
> tried not to change existing behaviour with this patch except where absolutely
> necessary (i.e. for XPATH)

Likewise.

> > Ideally, an invalid namespace URI should always be an error. While namespace
> > prefixes could become valid as you assemble a larger document, a namespace
> > URI's validity is context-free.
>
> I agree. That, however, would require xmlelement() to check all xmlns*
> attributes, and I didn't find a way to do that with libxml() (Well,
> other than to parse the element after creating it, but that's a huge
> kludge). So I left that issue for a later time...

Likewise.

> > Remembering to put a pg_xml_done() in every relevant elog(ERROR, ...) path
> > seems fragile. How about using RegisterXactCallback/RegisterSubXactCallback
> > in pgxml_parser_init() to handle those cases? You can also use it to Assert
> > at non-abort transaction shutdown that pg_xml_done() was called as needed.
>
> Hm, isn't that a bit fragile too? It seems that PG_TRY/PG_CATCH blocks don't
> necessarily create sub-transactions, do they?

PG_TRY never creates a subtransaction. However, elog(ERROR, ...) aborts
either the subtransaction, or if none, the top-level transaction. Therefore,
registering the same callback with both RegisterXactCallback and
RegisterSubXactCallback ensures that it gets called for any elog/ereport
non-local exit. Then you only explicitly call pg_xml_done() in the non-error
path.

However, I see now that in the core xml.c, every error-condition pg_xml_done()
appears in a PG_CATCH block. That's plenty reliable ...

> Also, most elog() paths already contained xmlFree() calls, because libxml
> seemingly cannot be made to use context-based memory management. So one
> already needs to be pretty careful when touching these...

... and this is a good reason to keep doing it that way. So, scratch that idea.

> > Consider renaming XML_REPORT_ERROR it to something like XML_CHECK_ERROR,
> > emphasizing that it wraps a conditional.
>
> Hm, I think it was named XML_CHECK_ERROR at one point, and I changed it
> because I felt it didn't convey the fact that it's actually ereport()
> and not just return false on error. But I agree that XML_REPORT_ERROR isn't
> very self-explanatory, either. What about XML_REPORT_PENDING_ERROR()
> or XML_CHECK_AND_REPORT_ERROR()?

XML_CHECK_AND_REPORT_ERROR() seems fine. Or XML_CHECK_AND_EREPORT()?

> > On Thu, Jun 09, 2011 at 06:11:46PM +0200, Florian Pflug wrote:
> >> *************** xml_parse(text *data, XmlOptionType xmlo
> >> *** 1175,1182 ****
> >> int32 len;
> >> xmlChar *string;
> >> xmlChar *utf8string;
> >> ! xmlParserCtxtPtr ctxt;
> >> ! xmlDocPtr doc;
> >>
> >> len = VARSIZE(data) - VARHDRSZ; /* will be useful later */
> >> string = xml_text2xmlChar(data);
> >> --- 1244,1251 ----
> >> int32 len;
> >> xmlChar *string;
> >> xmlChar *utf8string;
> >> ! xmlParserCtxtPtr ctxt = NULL;
> >> ! xmlDocPtr doc = NULL;
> >>
> >> len = VARSIZE(data) - VARHDRSZ; /* will be useful later */
> >> string = xml_text2xmlChar(data);
> >
> > Is this change needed?
>
> For "doc" it definitely is. Since we can no report an error even if
> the parser returned a valid doc, there's now a "if (doc) xmlFree(doc)"
> in the PG_CATCH patch.
>
> For "ctxt", strictly, speaking no. But only because "ctxt = xmlNewParserCtxt()"
> is the first line in the PG_TRY block, and therefore made the "xmlFree(ctxt)"
> unconditional. But doing it this way seems more consistent and less error-prone.

I see it now; thanks.

> > Is there something about this patch's other changes that make it necessary to
> > strip multiple newlines, vs. one newline as we did before, and to do it in a
> > different place in the code?
>
> There previously wasn't a separate function for that. I made it strip off
> multiple newlines because that allowed me to keep appendStringInfoLineSeparator()
> simple while still making sure that the end result is exactly one newline
> at the end of the string. Previously, the coded depended on libxml always
> adding a newline at the end of messages, which I'm not even sure it true
> for messages reported via the structured error handler.

Fair enough.

> > It's odd to have a purpose of "NONE" that pg_xml_init() callers can actually
> > use. How about XML_PURPOSE_TRIVIAL, for callers that only call libxml2
> > functions that can't error? Also, I think it would be better to require a
> > call to pg_xml_done() in all cases, just to keep things simple. Actually, I
> > wonder if it's worth having XML_PURPOSE_TRIVIAL for the one user thereof. It
> > doesn't save much.
>
> Yeah, that name really is a perfect example of horrible naming ;-)
>
> The pg_xml_done()-not-required semantics are necessary though. The problem is
> that parse_xml_decl() is sometimes called after the caller has already called
> pg_xml_init() himself. Not always though, so one cannot simply remove the
> pg_xml_init() call from parse_xml_decl() - it might then use libxml without
> it being initialized. So I made pg_xml_init() skip the error-handler setup
> when purpose == XML_PURPOSE_NONE. This requires then, however, that pg_xml_done()
> is *not* called, because it'd restore settings that were never saved (well,
> actually it'd tripe an assert that protects against that...).
>
> But yeah, this is all quite contorted. Maybe pg_xml_init() should simply be
> split into two functions, one which initialized the library and one which
> sets up error handling. That means that two calls instead of one are required,
> but it makes the purpose of the individual functions much clearer...

Oh, tricky. That's an option, and the error-handler-initialization function
could just call the more-primitive one, so most callers would not need to
know. Another way is to make xml_error_initialized a counter. If it's called
a second time, increment and do nothing more. Only unregister the handler
when it's decremented to zero. No strong preference here.

> >> ! XML_PURPOSE_LEGACY /* Save error message only, don't set error flag */,
> >
> > It's fine to set the flag for legacy users, considering they could just not
> > read it. The important distinction seems to be that we don't emit warnings or
> > notices in this case. Is that correct? If so, the comment should reflect
> > that emphasis. Then, consider updating the flag unconditionally.
>
> I took me a while to remember while I did it that way, but I think I have now.
>
> I initialled wanted to add an Assert(!xml_error_occurred) to catch any
> missing XML_REPORT_ERROR() calls. I seems to have forgotten to do that,
> though...
>
> So I guess I should either refrain from setting the flag as you suggested,
> or add such an Assert(). I think I very slightly prefer the latter, what
> do you think?

I do like the idea of that assert. How about setting the flag anyway, but
making it "Assert(xml_purpose == XML_PURPOSE_LEGACY || !xml_error_occurred)"?

> >> *** a/src/test/regress/expected/xml.out
> >> --- b/src/test/regress/expected/xml.out
> >> *************** INSERT INTO xmltest VALUES (3, '<wrong')
> >> *** 8,14 ****
> >> ERROR: invalid XML content
> >> LINE 1: INSERT INTO xmltest VALUES (3, '<wrong');
> >> ^
> >> ! DETAIL: Entity: line 1: parser error : Couldn't find end of Start Tag wrong line 1
> >> <wrong
> >> ^
> >> SELECT * FROM xmltest;
> >> --- 8,14 ----
> >> ERROR: invalid XML content
> >> LINE 1: INSERT INTO xmltest VALUES (3, '<wrong');
> >> ^
> >> ! DETAIL: line 1: Couldn't find end of Start Tag wrong line 1
> >
> > Any reason to change the error text this way?
>
> The "Entity:" prefix is added by libxml only for messages reported
> to the generic error handler. It *doesn't* refer to entities as defined
> in xml, but instead used in place of the file name if libxml
> doesn't have that at hand (because it's parsing from memory).
>
> So that "Entity:" prefix really doesn't have any meaning whatsoever.

So xmlParserPrintFileContext() sends different content to the generic error
handler from what "natural" calls to the handler would send?

> I believe that the compatibility risk is pretty low here, and removing
> the meaningless prefix makes the error message are bit nicer to read.
> But if you are concerned that there maybe applications out there who
> parse the error text, we could of course add it back. I must say that
> I don't really know what our policy regarding error message stability is...

If the libxml2 API makes it a major pain to preserve exact messages, I
wouldn't worry.

Thanks,
nm

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeroen Vermeulen 2011-06-20 20:57:51 Re: [BUG] Denormal float values break backup/restore
Previous Message Tom Lane 2011-06-20 20:44:20 Re: Re: [COMMITTERS] pgsql: Fixed string in German translation that causes segfault.