From: | Michael Paquier <michael(at)paquier(dot)xyz> |
---|---|
To: | Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Erik Wienhold <ewie(at)ewie(dot)name> |
Subject: | Regression with large XML data input |
Date: | 2025-07-24 03:12:28 |
Message-ID: | aIGknLuc8b8ega2X@paquier.xyz |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi all,
(Adding in CC Tom and Eric, as committer and author.)
A customer has reported a regression with the parsing of rather large
XML data, introduced by the set of backpatches done with f68d6aabb7e2
& friends.
The problem is introduced by the change from
xmlParseBalancedChunkMemory() to xmlNewNode() +
xmlParseInNodeContext() in xml_parse(), to avoid an issue in
xmlParseBalancedChunkMemory() in the range of libxml2 2.13.0-2.13.2
for a bug that has already been fixed upstream, where we use a
temporary root node for the case where parse_as_document is false.
If the input XML data is large enough, one gets a failure at the top
of the latest branches, and it worked properly before. Here is a
short test case (courtesy of a colleague, case that I've modified
slightly):
CREATE TABLE xmldata (id BIGINT PRIMARY KEY, message XML );
DO $$ DECLARE size_40mb TEXT := repeat('X', 40000000);
BEGIN
BEGIN
INSERT INTO xmldata (id, message) VALUES
( 1, (('<Root><Item><Name>Test40MB</Name><Content>' || size_40mb || '</Content></Item></Root>')::xml) );
RAISE NOTICE 'insert 40MB successful';
EXCEPTION WHEN OTHERS THEN RAISE NOTICE 'Error insert 40MB: %', SQLERRM;
END;
END $$;
Switching back to the previous code, where we rely on
xmlParseBalancedChunkMemory() fixes the issue. A quick POC is
attached. It fails one case in check-world with SERIALIZE because I
am not sure it is possible to pass down some options through
xmlParseBalancedChunkMemory(), still the regression is gone, and I am
wondering if there is not a better solution to be able to dodge the
original problem and still accept this case. One good thing is that
xmlParseBalancedChunkMemory() is able to return a list of nodes, that
we need for this code path of xml_parse(). So perhaps one solution
would be the addition of a code path with
xmlParseBalancedChunkMemory() depending on the options we want to
process, keeping the code path with the fake "content-root" for the
XML SERIALIZE case.
The patch in question has been applied first to 6082b3d5d3d1 on HEAD
impacting v18~, then it has been backpatched down to all stable
branches, like f68d6aabb7e2, introducing the regression in all the
stable branches since the minor releases done in August 2024, as of:
12.20, 13.16, 14.13, 15.8, 16.4.
Thoughts or comments?
--
Michael
Attachment | Content-Type | Size |
---|---|---|
0001-Fix-xml2-regression.patch | text/x-diff | 2.0 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Richard Guo | 2025-07-24 03:21:30 | Re: Eager aggregation, take 3 |
Previous Message | Peter Geoghegan | 2025-07-24 02:18:55 | Re: index prefetching |