Re: Regression with large XML data input

From: Robert Treat <rob(at)xzilla(dot)net>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Jim Jones <jim(dot)jones(at)uni-muenster(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Erik Wienhold <ewie(at)ewie(dot)name>
Subject: Re: Regression with large XML data input
Date: 2025-07-25 18:02:47
Message-ID: CABV9wwOY3pH+pA0R1hSq5g_DXqeDaGWRuoBEE4QwWLfTiw+nKw@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jul 24, 2025 at 8:08 PM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> On Fri, Jul 25, 2025 at 01:25:48AM +0200, Jim Jones wrote:
> > On 24.07.25 21:23, Tom Lane wrote:
> >> However, when testing on RHEL8 with libxml2 2.9.7, indeed
> >> I get "Huge input lookup" with our current code but no
> >> failure with f68d6aabb7e2^.
> >>
> >> The way I interpret these results is that in older libxml2 versions,
> >> xmlParseBalancedChunkMemory is missing an XML_ERR_RESOURCE_LIMIT check
> >> that does exist in newer versions. So even if we were to do some kind
> >> of reversion, it would only prevent the error in libxml2 versions that
> >> lack that check. And in those versions we'd probably be exposing
> >> ourselves to resource-exhaustion problems.
>
> Linux distributions may not seem very eager to add this check, though?
> The top of debian GID uses a version of libxml2 where the difference
> shows up, so it means that we have years ahead with the old code.
>
> If it were discussing things from the perspective where this new code
> was added after a major version bump of Postgres, I would not argue
> much about that: breakages happen every year and users adapt their
> applications to it. Here, however, we are talking about a change in a
> stable branch, across a minor version, which should be a bit more
> flawless from a user perspective? I may be influenced by the point of
> seeing a customer impacted by that, of course, there is no denying
> that. The point is that this enforces one behavior that's part of
> 2.13 and onwards. Versions of PG before f68d6aabb7e2 were still OK
> with that and the new code of Postgres closes the door completely.
> Even if the behavior that Postgres had when linking with libxml2 2.12
> or older was kind of "accidendal" because
> xmlParseBalancedChunkMemory() lacked the XML_ERR_RESOURCE_LIMIT check,
> it was there, and users relied on that.
>
> One possibility that I could see here for stable branches would be to
> make the code a bit smarter depending on LIBXML_VERSION, where we
> could keep the new code for 2.13 onwards, but keep a compatible
> behavior with 2.12 and older, based on xmlParseBalancedChunkMemory().
>

While I am pretty sympathetic to the idea that we hang our hats on
"Postgres doesn't break things in minor version updates", and this
seems to betray that, one scenario where we would break things is if
it were the only reasonable option wrt a bug / security fix, which
this seems potentially close to. If we can come up with a work around
like the above though, it would certainly be nice to give people a
path forward even if it ends up with a breaking major version change.
This at least eliminates the "surprise!" factor.

> >> On the whole I'm thinking more and more that we don't want to
> >> touch this. Our recommendation for processing multi-megabyte
> >> chunks of XML should be "don't". Unless we want to find or
> >> write a replacement for libxml2 ... which we have discussed,
> >> but so far nothing's happened.
> >
> > I also believe that addressing this limitation may not be worth the
> > associated risks. Moreover, a 10MB text node is rather large and
> > probably exceeds the needs of most users.
>
> Yeah, still some people use it, so while I am OK to accept this as a
> conclusion at the end and send back to this thread that workarounds
> are required in applications to split the inputs, that was really
> surprising. (Aka from the point of view of the customer whose
> application suddenly fails after what should have been a "simple"
> minor update.)

There are a lot of public data sets that provide xml dumps as a
generic format for "non-commercial databases", and those can often be
quite large. I suspect we don't see those use cases a lot because
historically users have been forced to resort to perl/python/etc
scripts to convert the data prior to ingesting. Which is to say, I
think these use cases are more common than we think, and if there were
ever a stable implementation that supported these large use cases,
we'll start to see more of it.

Robert Treat
https://xzilla.net

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Patrick Stählin 2025-07-25 18:06:39 Re: Draft for basic NUMA observability
Previous Message Robert Haas 2025-07-25 18:01:01 Re: Test instability when pg_dump orders by OID