Re: Cleaning up array_in()

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Nikhil Benesch <nikhil(dot)benesch(at)gmail(dot)com>
Cc: Alexander Lakhin <exclusion(at)gmail(dot)com>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, jian he <jian(dot)universality(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Cleaning up array_in()
Date: 2023-07-08 19:38:31
Message-ID: 5859ce4e-2be4-92b0-c85c-e1e24eab57c6@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 08/07/2023 19:08, Tom Lane wrote:
> I wrote:
>> So I end up with the attached. I went ahead and dropped
>> ArrayGetOffset0() as part of 0001, and I split 0002 into two patches
>> where the new 0002 avoids re-indenting any existing code in order
>> to ease review, and then 0003 is just a mechanical application
>> of pgindent.
>
> That got sideswiped by ae6d06f09, so here's a trivial rebase to
> pacify the cfbot.
>
> #text/x-diff; name="v3-0001-Simplify-and-speed-up-ReadArrayStr.patch" [v3-0001-Simplify-and-speed-up-ReadArrayStr.patch] /home/tgl/pgsql/v3-0001-Simplify-and-speed-up-ReadArrayStr.patch
> #text/x-diff; name="v3-0002-Rewrite-ArrayCount-to-make-dimensionality-checks-.patch" [v3-0002-Rewrite-ArrayCount-to-make-dimensionality-checks-.patch] /home/tgl/pgsql/v3-0002-Rewrite-ArrayCount-to-make-dimensionality-checks-.patch
> #text/x-diff; name="v3-0003-Re-indent-ArrayCount.patch" [v3-0003-Re-indent-ArrayCount.patch] /home/tgl/pgsql/v3-0003-Re-indent-ArrayCount.patch

Something's wrong with your attachments.

Hmm, I wonder if ae6d06f09 had a negative performance impact. In an
unquoted array element, scanner_isspace() function is called for every
character, so it might be worth inlining.

On the patches: They are a clear improvement, thanks for that. That
said, I still find the logic very hard to follow, and there are some
obvious performance optimizations that could be made.

ArrayCount() interprets low-level quoting and escaping, and tracks the
dimensions at the same time. The state machine is pretty complicated.
And when you've finally finished reading and grokking that function, you
see that ReadArrayStr() repeats most of the same logic. Ugh.

I spent some time today refactoring it for readability and speed. I
introduced a separate helper function to tokenize the input. It deals
with whitespace, escapes, and backslashes. Then I merged ArrayCount()
and ReadArrayStr() into one function that parses the elements and
determines the dimensions in one pass. That speeds up parsing large
arrays. With the tokenizer function, the logic in ReadArrayStr() is
still quite readable, even though it's now checking the dimensions at
the same time.

I also noticed that we used atoi() to parse the integers in the
dimensions, which doesn't do much error checking. Some funny cases were
accepted because of that, for example:

postgres=# select '[1+-+-+-+-+-+:2]={foo,bar}'::text[];
text
-----------
{foo,bar}
(1 row)

I tightened that up in the passing.

Attached are your patches, rebased to fix the conflicts with ae6d06f09
like you intended. On top of that, my patches. My patches need more
testing, benchmarking, and review, so if we want to sneak something into
v16, better go with just your patches. If we're tightening up the
accepted inputs, maybe fix that atoi() sloppiness, though.

--
Heikki Linnakangas
Neon (https://neon.tech)

Attachment Content-Type Size
v3-0001-Simplify-and-speed-up-ReadArrayStr.patch text/x-patch 8.1 KB
v3-0002-Rewrite-ArrayCount-to-make-dimensionality-checks-.patch text/x-patch 20.3 KB
v3-0003-Re-indent-ArrayCount.patch text/x-patch 10.7 KB
v3-0004-Extract-loop-to-read-array-dimensions-to-subrouti.patch text/x-patch 8.2 KB
v3-0005-Determine-array-dimensions-and-parse-the-elements.patch text/x-patch 38.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Gurjeet Singh 2023-07-08 19:47:32 Re: DecodeInterval fixes
Previous Message Tom Lane 2023-07-08 19:34:51 Re: Cleaning up array_in()