Re: WIP - xmlvalidate implementation from TODO list

From: Marcos Magueta <maguetamarcos(at)gmail(dot)com>
To: Kirill Reshke <reshkekirill(at)gmail(dot)com>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: WIP - xmlvalidate implementation from TODO list
Date: 2025-12-07 19:43:40
Message-ID: CAN3aFCesNDiL-iZg4imC0n+NgT3JywqZYkuGH83u8ssLjJ-p5Q@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thank you for your kind review!

Before I continue with the implementation, I would like to address
your concerns and discuss it further and see if it's worth carrying
on.

1) Will do!

2) The issue is the following:
```
./../../src/test/regress/pg_regress --temp-instance=./tmp_check
--inputdir=. --bindir= --dlpath=. --max-concurrent-tests=20
--schedule=./parallel_schedule
# +++ regress check in src/test/regress +++
# initializing database system by copying initdb template
# could not exec "sh": No such file or directory
Bail out!# postmaster failed, examine
```

This is likely due to my setup on Nix. If any command assumes paths on
conventional *nix I am often in trouble. I am checking that with a
friend, but any insights are welcome. The out file I generated was
while being completely blindfolded.

By the way, the diff you sent is assuming global paths for some
reason, so I couldn't apply it without manually changing them.

3) From what I understand, that refers to ISO/IEC 9075-14:2016, chapter and
section 11.6, page 245:
- Parts of the grammar that reference an URI: <XML valid according to
what> ::= <XML valid according to URI> | <XML valid according to identifier>
- NO NAMESPACE, which unspecifies the qualification but can access
something through a LOCATI ON
- ID which should allow access to a registered schema
So they amount to:
ACCORDING TO XMLSCHEMA URI <uri> [LOCATION <uri>]
ACCORDING TO XMLSCHEMA NO NAMESPACE [LOCATION <uri>]
ACCORDING TO XMLSCHEMA ID <registered_schema_name>

What I did is rely on the protection mechanisms that are already
implemented to just side-step the issue of arbitrary retrieval.

`PgXmlErrorContext * pg_xml_init(PgXmlStrictness strictness)` starts
the xml error context preventing any attempt by libxml2 to load an
external entity (DTD, XSD from URL, local file, etc.) returns an empty
string instead. Check around the line 1420 of xml.c:

```
/*
* Also, install an entity loader to prevent unwanted fetches of external
* files and URLs.
*/
errcxt->saved_entityfunc = xmlGetExternalEntityLoader();
xmlSetExternalEntityLoader(xmlPgEntityLoader);
```

So since I am relying on a TEXT for the schema, there should be no
issues of that sort. It does however cut part of the grammar that
handles locations, which is part of the standard, and would require
this feature to be much bigger in scope...

4) I suppose you refer to doc/src/sgml/func/func-xml.sgml. Will do

5) Hmm my intent was to simply handle TEXT on the xmlschema portion,
so the expr rule on that side is indeed an oversight. Now about the
first argument, that is just following the pattern already specified
in other xml functions, like XMLPARSE and XMLSERIALIZE, which have the
same <XML Value Expression> specified in the grammar. So that might
have alraedy diverged from the standard a while back...

This is currently grammatically valid, for example:
```
select xmlparse(DOCUMENT (select oid from pg_class limit 1));
ERROR: invalid XML document
DETAIL: line 1: Start tag expected, '<' not found
2619
^
```

As a summary, it does not fully implement schemas as first-class
objects, as that would require extra parts of the grammar specified in
the standard, so I capitulated to use schemas as provided text. That's
already safe in my understanding given the shielding in place. If we
are to implement the rest, I think other serious concerns would arise,
like role management, how to store schemas, etc. And when it comes
to that, is it worth all the trouble just for xml? I would like this
feature and I think the solution of relying on text is decent, since
the cost of complying 100% seems very high for low returns.

Em dom., 7 de dez. de 2025 às 04:34, Kirill Reshke <reshkekirill(at)gmail(dot)com>
escreveu:

> On Sun, 7 Dec 2025 at 04:38, Marcos Magueta <maguetamarcos(at)gmail(dot)com>
> wrote:
> >
> > Hello!
> >
> > I am likely one of the few people still using XML, but I noticed XSD
> schema validation is still a TODO on postgresql, which I have some personal
> use cases for.
> >
> > In this patch I attempt to implement XMLVALIDATE with the already in use
> libxml following a version I got my hands on of the SQL/XML standard of
> 2016.
> >
> > In short, I had to add the ACCORDING word to comply with it and
> completely ignored the version that was already in the src that fetches
> arbitrary schemas (it refers to validations of dtds, which is another more
> troublesome TODO).
> >
> > I had problems running the regression tests on my machine, so I could
> only test the feature by spawning a modified instance of postgresql and
> issuing queries through psql, therefore I am marking it as WIP. If anyone
> can assert the tests pass, I would be glad.
> >
> > Also, this is my first patch, so I might have not followed standard
> practices as best as I could, so please pay particular attention to that on
> review.
> >
> > Cheers,
> > Marcos Magueta.
>
> HI!
>
> 1)
> > + // Default case since nothing got returned
> > + // out of the normal path for validation calls to libxml
>
> PostgreSQL uses /**/ comments style.
>
> 2)
> XML regression test suite fails, see attached. By the way, what are
> your issues with running `make check` ?
>
> 3)
> By the way, in [0] we have this
>
> `
> The function in PostgreSQL produces an “unimplemented” error, because
> PostgreSQL does not have any implementation of the mechanism assumed
> in the standard for registering schemas in advance, which is necessary
> to address the security implications of a function that could refer to
> schemas from arbitrary locations.
> `
>
> How does your patch resolve this? I did not find any change in this area
>
> 4)
> Also I want to mention that without a doc, the patch is not in a
> commitable shape.
>
> 5) I am a bit surprised by this grammar rule:
>
> > XMLVALIDATE '(' document_or_content a_expr ACCORDING TO XMLSCHEMA
> a_expr ')'
>
> this allow a wide class of expressions accepted by parser, like
>
> `SELECT xmlvalidate(DOCUMENT (select oid from pg_class) ACCORDING TO
> XMLSCHEMA (select 32)) AS is_valid FROM xml_validation_test;`
>
> Is this expected? a_expr is way more than string constants and column
> references.. If yes, the regression test that you added, does not
> test this..
>
>
> p.s. I failed to find in google SQL/XML standard of 2016. So, I cannot
> double-check if this feature is compliant with it...
>
> [0] https://wiki.postgresql.org/wiki/PostgreSQL_vs_SQL/XML_Standards
>
> --
> Best regards,
> Kirill Reshke
>

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2025-12-07 19:45:07 Re: Make copyObject work in C++
Previous Message David Geier 2025-12-07 19:10:24 Re: commented out code