Re: [PATCH] Add pretty-printed XML output option

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Peter Smith <smithpb2250(at)gmail(dot)com>
Cc: Jim Jones <jim(dot)jones(at)uni-muenster(dot)de>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Nikolay Samokhvalov <samokhvalov(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andrey Borodin <amborodin86(at)gmail(dot)com>
Subject: Re: [PATCH] Add pretty-printed XML output option
Date: 2023-03-09 20:21:36
Message-ID: 532259.1678393296@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Peter Smith <smithpb2250(at)gmail(dot)com> writes:
> The patch v19 LGTM.

I've looked through this now, and have some minor complaints and a major
one. The major one is that it doesn't work for XML that doesn't satisfy
IS DOCUMENT. For example,

regression=# select '<bar><val x="y">42</val></bar><foo></foo>'::xml is document;
?column?
----------
f
(1 row)

regression=# select xmlserialize (content '<bar><val x="y">42</val></bar><foo></foo>' as text);
xmlserialize
-------------------------------------------
<bar><val x="y">42</val></bar><foo></foo>
(1 row)

regression=# select xmlserialize (content '<bar><val x="y">42</val></bar><foo></foo>' as text indent);
ERROR: invalid XML document
DETAIL: line 1: Extra content at the end of the document
<bar><val x="y">42</val></bar><foo></foo>
^

This is not what the documentation promises, and I don't think it's
good enough --- the SQL spec has no restriction saying you can't
use INDENT with CONTENT. I tried adjusting things so that we call
xml_parse() with the appropriate DOCUMENT or CONTENT xmloption flag,
but all that got me was empty output (except for a document header).
It seems like xmlDocDumpFormatMemory is not the thing to use, at least
not in the CONTENT case. But libxml2 has a few other "dump"
functions, so maybe we can use a different one? I see we are using
xmlNodeDump elsewhere, and that has a format option, so maybe there's
a way forward there.

A lesser issue is that INDENT tacks on a document header (XML declaration)
whether there was one or not. I'm not sure whether that's an appropriate
thing to do in the DOCUMENT case, but it sure seems weird in the CONTENT
case. We have code that can strip off the header again, but we
need to figure out exactly when to apply it.

I also suspect that it's outright broken to attach a header claiming
the data is now in UTF8 encoding. If the database encoding isn't
UTF8, then either that's a lie or we now have an encoding violation.

Another thing that's mildly irking me is that the current
factorization of this code will result in xml_parse'ing the data
twice, if you have both DOCUMENT and INDENT specified. We could
consider avoiding that if we merged the indentation functionality
into xmltotext_with_xmloption, but it's probably premature to do so
when we haven't figured out how to get the output right --- we might
end up needing two xml_parse calls anyway with different parameters,
perhaps.

I also had a bunch of cosmetic complaints (mostly around this having
a bad case of add-at-the-end-itis), which I've cleaned up in the
attached v20. This doesn't address any of the above, however.

regards, tom lane

Attachment Content-Type Size
v20-0001-Add-pretty-printed-XML-output-option.patch text/x-diff 24.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2023-03-09 20:25:00 Re: buildfarm + meson
Previous Message Jeff Davis 2023-03-09 20:17:53 Re: ICU locale validation / canonicalization