[PATCH] Add CANONICAL option to xmlserialize

From: Jim Jones <jim(dot)jones(at)uni-muenster(dot)de>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: [PATCH] Add CANONICAL option to xmlserialize
Date: 2023-03-05 18:44:01
Message-ID: 2f04079e-b237-1a2f-0682-9f60ca59805e@uni-muenster.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 27.02.23 14:16, I wrote:
> Hi,
>
> In order to compare pairs of XML documents for equivalence it is
> necessary to convert them first to their canonical form, as described
> at W3C Canonical XML 1.1.[1] This spec basically defines a standard
> physical representation of xml documents that have more then one
> possible representation, so that it is possible to compare them, e.g.
> forcing UTF-8 encoding, entity reference replacement, attributes
> normalization, etc.
>
> Although it is not part of the XML/SQL standard, it would be nice to
> have the option CANONICAL in xmlserialize. Additionally, we could also
> add the attribute WITH [NO] COMMENTS to keep or remove xml comments
> from the documents.
>
> Something like this:
>
> WITH t(col) AS (
>  VALUES
>   ('<?xml version="1.0" encoding="ISO-8859-1"?>
>   <!DOCTYPE doc SYSTEM "doc.dtd" [
>   <!ENTITY val "42">
>   <!ATTLIST xyz attr CDATA "default">
>   ]>
>
>   <!-- ordering of attributes -->
>   <foo ns:c = "3" ns:b = "2" ns:a = "1"
>     xmlns:ns="http://postgresql.org">
>
>     <!-- Normalization of whitespace in start and end tags -->
>     <!-- Elimination of superfluous namespace declarations,
>          as already declared in <foo> -->
>  <bar     xmlns:ns="http://postgresql.org" >&val;</bar     >
>
>     <!-- Empty element conversion to start-end tag pair -->
>     <empty/>
>
>     <!-- Effect of transcoding from a sample encoding to UTF-8 -->
>     <iso8859>&#169;</iso8859>
>
>     <!-- Addition of default attribute -->
>     <!-- Whitespace inside tag preserved -->
>     <xyz> 321 </xyz>
>   </foo>
>   <!-- comment outside doc -->'::xml)
> )
> SELECT xmlserialize(DOCUMENT col AS text CANONICAL) FROM t;
> xmlserialize
> --------------------------------------------------------------------------------------------------------------------------------------------------------
>
>  <foo xmlns:ns="http://postgresql.org" ns:a="1" ns:b="2"
> ns:c="3"><bar>42</bar><empty></empty><iso8859>©</iso8859><xyz
> attr="default"> 321 </xyz></foo>
> (1 row)
>
> -- using WITH COMMENTS
>
> WITH t(col) AS (
>  VALUES
>   (' <foo ns:c = "3" ns:b = "2" ns:a = "1"
>     xmlns:ns="http://postgresql.org">
>     <!-- very important comment -->
>     <xyz> 321 </xyz>
>   </foo>'::xml)
> )
> SELECT xmlserialize(DOCUMENT col AS text CANONICAL WITH COMMENTS) FROM t;
> xmlserialize
> ------------------------------------------------------------------------------------------------------------------------
>
>  <foo xmlns:ns="http://postgresql.org" ns:a="1" ns:b="2" ns:c="3"><!--
> very important comment --><xyz> 321 </xyz></foo>
> (1 row)
>
>
> Another option would be to simply create a new function, e.g.
> xmlcanonical(doc xml, keep_comments boolean), but I'm not sure if this
> would be the right approach.
>
> Attached a very short draft. What do you think?
>
> Best, Jim
>
> 1- https://www.w3.org/TR/xml-c14n11/

The attached version includes documentation and tests to the patch.

I hope things are clearer now :)

Best, Jim

Attachment Content-Type Size
v1-0001-Add-CANONICAL-format-to-xmlserialize.patch text/x-patch 31.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Melanie Plageman 2023-03-05 20:26:12 Re: Should vacuum process config file reload more often
Previous Message Ankit Kumar Pandey 2023-03-05 18:17:57 Re: [Question] Similar Cost but variable execution time in sort