[PoC] Add CANONICAL option to xmlserialize

From: Jim Jones <jim(dot)jones(at)uni-muenster(dot)de>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: [PoC] Add CANONICAL option to xmlserialize
Date: 2023-02-27 13:16:30
Message-ID: 67fa8560-8d61-5d06-8178-fc9c7684db90@uni-muenster.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

In order to compare pairs of XML documents for equivalence it is
necessary to convert them first to their canonical form, as described at
W3C Canonical XML 1.1.[1] This spec basically defines a standard
physical representation of xml documents that have more then one
possible representation, so that it is possible to compare them, e.g.
forcing UTF-8 encoding, entity reference replacement, attributes
normalization, etc.

Although it is not part of the XML/SQL standard, it would be nice to
have the option CANONICAL in xmlserialize. Additionally, we could also
add the attribute WITH [NO] COMMENTS to keep or remove xml comments from
the documents.

Something like this:

WITH t(col) AS (
 VALUES
  ('<?xml version="1.0" encoding="ISO-8859-1"?>
  <!DOCTYPE doc SYSTEM "doc.dtd" [
  <!ENTITY val "42">
  <!ATTLIST xyz attr CDATA "default">
  ]>

  <!-- ordering of attributes -->
  <foo ns:c = "3" ns:b = "2" ns:a = "1"
    xmlns:ns="http://postgresql.org">

    <!-- Normalization of whitespace in start and end tags -->
    <!-- Elimination of superfluous namespace declarations,
         as already declared in <foo> -->
 <bar     xmlns:ns="http://postgresql.org" >&val;</bar     >

    <!-- Empty element conversion to start-end tag pair -->
    <empty/>

    <!-- Effect of transcoding from a sample encoding to UTF-8 -->
    <iso8859>&#169;</iso8859>

    <!-- Addition of default attribute -->
    <!-- Whitespace inside tag preserved -->
    <xyz> 321 </xyz>
  </foo>
  <!-- comment outside doc -->'::xml)
)
SELECT xmlserialize(DOCUMENT col AS text CANONICAL) FROM t;
xmlserialize
--------------------------------------------------------------------------------------------------------------------------------------------------------
 <foo xmlns:ns="http://postgresql.org" ns:a="1" ns:b="2"
ns:c="3"><bar>42</bar><empty></empty><iso8859>©</iso8859><xyz
attr="default"> 321 </xyz></foo>
(1 row)

-- using WITH COMMENTS

WITH t(col) AS (
 VALUES
  (' <foo ns:c = "3" ns:b = "2" ns:a = "1"
    xmlns:ns="http://postgresql.org">
    <!-- very important comment -->
    <xyz> 321 </xyz>
  </foo>'::xml)
)
SELECT xmlserialize(DOCUMENT col AS text CANONICAL WITH COMMENTS) FROM t;
xmlserialize
------------------------------------------------------------------------------------------------------------------------
 <foo xmlns:ns="http://postgresql.org" ns:a="1" ns:b="2" ns:c="3"><!--
very important comment --><xyz> 321 </xyz></foo>
(1 row)

Another option would be to simply create a new function, e.g.
xmlcanonical(doc xml, keep_comments boolean), but I'm not sure if this
would be the right approach.

Attached a very short draft. What do you think?

Best, Jim

1- https://www.w3.org/TR/xml-c14n11/

Attachment Content-Type Size
Add-CANONICAL-Option-to-xmlserialize-draft.patch text/x-patch 8.4 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dagfinn Ilmari Mannsåker 2023-02-27 13:22:53 Adding argument names to aggregate functions
Previous Message Juan José Santamaría Flecha 2023-02-27 13:11:34 Re: buildfarm + meson