Re: [PATCH] Add pretty-printed XML output option

From: Jim Jones <jim(dot)jones(at)uni-muenster(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Peter Smith <smithpb2250(at)gmail(dot)com>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Nikolay Samokhvalov <samokhvalov(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andrey Borodin <amborodin86(at)gmail(dot)com>
Subject: Re: [PATCH] Add pretty-printed XML output option
Date: 2023-03-10 13:30:07
Message-ID: 906cfd1d-63a0-3cf5-7291-2dadb2e149ff@uni-muenster.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thanks a lot for the review!

On 09.03.23 21:21, Tom Lane wrote:
> I've looked through this now, and have some minor complaints and a major
> one. The major one is that it doesn't work for XML that doesn't satisfy
> IS DOCUMENT. For example,
>
> regression=# select '<bar><val x="y">42</val></bar><foo></foo>'::xml is document;
> ?column?
> ----------
> f
> (1 row)
>
> regression=# select xmlserialize (content '<bar><val x="y">42</val></bar><foo></foo>' as text);
> xmlserialize
> -------------------------------------------
> <bar><val x="y">42</val></bar><foo></foo>
> (1 row)
>
> regression=# select xmlserialize (content '<bar><val x="y">42</val></bar><foo></foo>' as text indent);
> ERROR: invalid XML document
> DETAIL: line 1: Extra content at the end of the document
> <bar><val x="y">42</val></bar><foo></foo>
> ^

I assumed it should fail because the XML string doesn't have a
singly-rooted XML. Oracle has this feature implemented and it does not
seem to allow non singly-rooted strings either[1]. Also, some the tools
I use also fail in this case[2,3]

How do you suggest the output should look like? Does the SQL spec also
define it? I can't find it online :(

> This is not what the documentation promises, and I don't think it's
> good enough --- the SQL spec has no restriction saying you can't
> use INDENT with CONTENT. I tried adjusting things so that we call
> xml_parse() with the appropriate DOCUMENT or CONTENT xmloption flag,
> but all that got me was empty output (except for a document header).
> It seems like xmlDocDumpFormatMemory is not the thing to use, at least
> not in the CONTENT case. But libxml2 has a few other "dump"
> functions, so maybe we can use a different one? I see we are using
> xmlNodeDump elsewhere, and that has a format option, so maybe there's
> a way forward there.
>
> A lesser issue is that INDENT tacks on a document header (XML declaration)
> whether there was one or not. I'm not sure whether that's an appropriate
> thing to do in the DOCUMENT case, but it sure seems weird in the CONTENT
> case. We have code that can strip off the header again, but we
> need to figure out exactly when to apply it.
I replaced xmlDocDumpFormatMemory with xmlSaveToBuffer and used to
option XML_SAVE_NO_DECL for input docs with XML declaration. It no
longer returns a XML declaration if the input doc does not contain it.
> I also suspect that it's outright broken to attach a header claiming
> the data is now in UTF8 encoding. If the database encoding isn't
> UTF8, then either that's a lie or we now have an encoding violation.
I was mistakenly calling xml_parse with GetDatabaseEncoding(). It now
uses the encoding of the given doc and UTF8 if not provided.
> Another thing that's mildly irking me is that the current
> factorization of this code will result in xml_parse'ing the data
> twice, if you have both DOCUMENT and INDENT specified. We could
> consider avoiding that if we merged the indentation functionality
> into xmltotext_with_xmloption, but it's probably premature to do so
> when we haven't figured out how to get the output right --- we might
> end up needing two xml_parse calls anyway with different parameters,
> perhaps.
>
> I also had a bunch of cosmetic complaints (mostly around this having
> a bad case of add-at-the-end-itis), which I've cleaned up in the
> attached v20. This doesn't address any of the above, however.
I swear to god I have no idea what "add-at-the-end-itis" means :)
> regards, tom lane

Thanks a lot!

Best, Jim

1 - https://dbfiddle.uk/WUOWtjBU

2 - https://www.samltool.com/prettyprint.php

3 - https://xmlpretty.com/xmlpretty

Attachment Content-Type Size
v21-0001-Add-pretty-printed-XML-output-option.patch text/x-patch 29.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Önder Kalacı 2023-03-10 14:28:33 Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher
Previous Message Justin Pryzby 2023-03-10 13:05:49 Re: Add LZ4 compression in pg_dump