Re: Regression with large XML data input

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Treat <rob(at)xzilla(dot)net>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, Jim Jones <jim(dot)jones(at)uni-muenster(dot)de>, Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Erik Wienhold <ewie(at)ewie(dot)name>
Subject: Re: Regression with large XML data input
Date: 2025-07-25 18:21:26
Message-ID: 1944118.1753467686@sss.pgh.pa.us
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Robert Treat <rob(at)xzilla(dot)net> writes:
> On Thu, Jul 24, 2025 at 8:08 PM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
>> If it were discussing things from the perspective where this new code
>> was added after a major version bump of Postgres, I would not argue
>> much about that: breakages happen every year and users adapt their
>> applications to it. Here, however, we are talking about a change in a
>> stable branch, across a minor version, which should be a bit more
>> flawless from a user perspective?

> While I am pretty sympathetic to the idea that we hang our hats on
> "Postgres doesn't break things in minor version updates", and this
> seems to betray that, one scenario where we would break things is if
> it were the only reasonable option wrt a bug / security fix, which
> this seems potentially close to.

I'll be the first to say that I'm not too pleased with it either.
However, from Jim Jones' result upthread, a "minor update" of libxml2
could also have caused this problem: 2.9.7 and 2.9.14 behave
differently. So we don't have sole control --- or sole responsibility
--- here.

I'd be more excited about trying to avoid the failure if I were not
afraid that "avoid the failure" really means "re-expose a security
hazard". Why should we believe that if libxml2 throws a
resource-limit error (for identical inputs) in one code path and not
another, that's anything but a missed error check in the second path?
(Maybe this is the same thing Robert is saying, not quite sure.)

> There are a lot of public data sets that provide xml dumps as a
> generic format for "non-commercial databases", and those can often be
> quite large. I suspect we don't see those use cases a lot because
> historically users have been forced to resort to perl/python/etc
> scripts to convert the data prior to ingesting. Which is to say, I
> think these use cases are more common than we think, and if there were
> ever a stable implementation that supported these large use cases,
> we'll start to see more of it.

Yeah, it's a real shame that we don't have more-reliable
infrastructure for XML. I'm not volunteering to fix it though...

regards, tom lane

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrey Borodin 2025-07-25 18:33:39 Re: IPC/MultixactCreation on the Standby server
Previous Message Patrick Stählin 2025-07-25 18:06:39 Re: Draft for basic NUMA observability