Re: [PATCH] avoid double scanning in function byteain

From: Stepan Neretin <slpmcf(at)gmail(dot)com>
To: Steven Niu <niushiji(at)gmail(dot)com>
Cc: Kirill Reshke <reshkekirill(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: [PATCH] avoid double scanning in function byteain
Date: 2025-05-09 10:57:35
Message-ID: CA+Yyo5R2FaYJUXsQJctt5kFHCrbnZNJ=H=b3brveGHREnhGpBQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, May 9, 2025 at 5:37 PM Stepan Neretin <slpmcf(at)gmail(dot)com> wrote:

>
>
> On Fri, May 9, 2025 at 5:24 PM Stepan Neretin <slpmcf(at)gmail(dot)com> wrote:
>
>>
>>
>> On Wed, Mar 26, 2025 at 9:39 PM Steven Niu <niushiji(at)gmail(dot)com> wrote:
>>
>>>
>>> 在 2025/3/26 16:37, Kirill Reshke 写道:
>>> > On Wed, 26 Mar 2025 at 12:17, Steven Niu <niushiji(at)gmail(dot)com> wrote:
>>> >>
>>> >> Hi,
>>> >
>>> > Hi!
>>> >
>>> >> This double scanning can be inefficient, especially for large inputs.
>>> >> So I optimized the function to eliminate the need for two scans,
>>> >> while preserving correctness and efficiency.
>>> >
>>> > While the argument that processing input once not twice is fast is
>>> > generally true, may we have some simple bench here just to have an
>>> > idea how valuable this patch is?
>>> >
>>> >
>>> > Patch:
>>> >
>>> >
>>> >> + /* Handle traditional escaped style in a single pass */
>>> >> + input_len = strlen(inputText);
>>> >> + result = palloc(input_len + VARHDRSZ); /* Allocate max possible
>>> size */
>>> >> rp = VARDATA(result);
>>> >> + tp = inputText;
>>> >> +
>>> >> while (*tp != '\0')
>>> >
>>> >
>>> > Isn't this `strlen` O(n) + `while` O(n)? Where is the speed up?
>>> >
>>> >
>>> >
>>> > [0]
>>> https://github.com/bminor/glibc/blob/master/string/strlen.c#L43-L45
>>> >
>>>
>>>
>>>
>>> Hi, Kirill,
>>>
>>> Your deep insight suprised me!
>>>
>>> Yes, you are correct that strlen() actually performed a loop operation.
>>> So maybe the performance difference is not so obvious.
>>>
>>> However, there are some other reasons that drive me to make this change.
>>>
>>> 1. The author of original code left comment: "BUGS: The input is scanned
>>> twice." .
>>> You can find this line of code in my patch. This indicates a left work
>>> to be done.
>>>
>>> 2. If I were the author of this function, I would not be satisfied with
>>> myself that I used two loops to do something which actually can be done
>>> with one loop.
>>> I prefer to choose a way that would not add more burden to readers.
>>>
>>> 3. The while (*tp != '\0') loop has some unnecessary codes and I made
>>> some change.
>>>
>>> Thanks,
>>> Steven
>>>
>>>
>>>
>>
>>
>> Hi hackers,
>>
>> This is a revised version (v2) of the patch that optimizes the
>> `byteain()` function.
>>
>> The original implementation handled escaped input by scanning the string
>> twice — first to determine the output size and again to fill in the bytea.
>> This patch eliminates the double scan by using a single-pass approach with
>> `StringInfo`, simplifying the logic and improving maintainability.
>>
>> Changes since v1 (originally by Steven Niu):
>> - Use `StringInfo` instead of manual memory allocation.
>> - Remove redundant code and improve readability.
>> - Add regression tests for both hex and escaped formats.
>>
>> This version addresses performance and clarity while ensuring
>> compatibility with existing behavior. The patch also reflects discussion on
>> the original version, including feedback from Kirill Reshke.
>>
>> Looking forward to your review and comments.
>>
>> Best regards,
>> Stepan Neretin
>>
>
>
> Hi,
>
> I noticed that the previous version of the patch was authored with an
> incorrect email address due to a misconfigured git config.
>
> I've corrected the author information in this v2 and made sure it's
> consistent with my usual contributor identity. No other changes were
> introduced apart from that and the updates discussed earlier.
>
> Sorry for the confusion, and thanks for your understanding.
>
> Best regards,
>
> Stepan Neretin
>

Hi,

Sorry for the noise — I'm resending the patch because I noticed a compiler
warning related to mixed declarations and code, which I’ve now fixed.

Apologies for the oversight in the previous submission.

Regards,

Stepan Neretin

Attachment Content-Type Size
0001-Optimize-function-byteain-to-avoid-double-scanning.patch text/x-patch 3.1 KB
0002-Refactor-byteain-to-avoid-double-scanning-and-simpli.patch text/x-patch 7.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alena Rybakina 2025-05-09 12:03:55 Re: Vacuum statistics
Previous Message Yasir 2025-05-09 10:40:08 Re: Why our Valgrind reports suck