Re: [PATCH] Refactor bytea_sortsupport(), take two

From: Aleksander Alekseev <aleksander(at)tigerdata(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Cc: John Naylor <johncnaylorls(at)gmail(dot)com>
Subject: Re: [PATCH] Refactor bytea_sortsupport(), take two
Date: 2025-11-17 10:56:54
Message-ID: CAJ7c6TNEcEDJ2_XckLBmbfNz+Tpeg=srP1hP8t6Oq1XkD31Fxg@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi John,

> Looks mostly okay. The "See varstr_abbrev_abort() for more details."
> are repeated several times close together, so I would replace those
> with a header comment like "This is based on varstr_abbrev_abort(),
> but some comments have been elided for brevity. See there for more
> details."

OK, fixed.

> + * the worst case, where we do many string transformations for no saving
> + * in full strcoll()-based comparisons. These statistics are used by
> + * bytea_abbrev_abort().
>
> s/strcoll/memcmp/, and maybe s/transformations/abbreviations/ since
> that was talking about strxfrm().

Agree. Fixed.

> > > - * More generally, it's okay that bytea callers can have NUL bytes in
> > > - * strings because abbreviated cmp need not make a distinction between
> > > - * terminating NUL bytes, and NUL bytes representing actual NULs in the
> > > - * authoritative representation. Hopefully a comparison at or past one
>
> > > Why is this gone -- AFAICT it's still true for bytea, no matter what
> > > file it's located in?
> >
> > Actually for bytea_abbrev_convert() this comment makes no sense IMO.
> > Presumably the reader is aware that '\x0000...' is a valid bytea, and
> > there is no such thing as "terminating NUL bytes" for bytea.
>
> It makes perfect sense to me. The abbreviated datum has terminating
> NUL bytes if the source length is shorter than 8 bytes. The comment is
> explaining why comparing bytea abbreviated datums still works. It
> seems like everything starting with "abbreviated cmp need not ..." is
> still appropriate for bytea.c.

OK, I returned the comment to bytea.c, but replaced "string" with
"bytea". IMO mentioning "strings" is confusing in this context.

I still don't like the fact that the comment is reasoning about
"terminating NUL bytes" in the context of bytea, but I kept it as is
for now. It seems confusing to me. Do you think we should rewrite
this?

> Likewise, this part of varlena.c is still factually accurate as an aside:
>
> - * (Actually, even if there were NUL bytes in the blob it would be
> - * okay. See remarks on bytea case above.)
>
> ...although with the above, we'll have to point to the relevant place
> in bytea.c.

Not sure if I follow. Previously we agreed that we disallow NUL bytes
within strings, except for terminating ones. What we are doing here is
refactoring / simplifying the code. Why keep irrelevant and confusing
comments?

> I only have one real nitpick left (diff'ing between files):
>
> /* Hash abbreviated key */
> {
> - uint32 tmp;
> + uint32 lohalf,
> + hihalf;
>
> - tmp = DatumGetUInt32(res) ^ (uint32) (DatumGetUInt64(res) >> 32);
> - hash = DatumGetUInt32(hash_uint32(tmp));
> + lohalf = (uint32) res;
> + hihalf = (uint32) (res >> 32);
> + hash = DatumGetUInt32(hash_uint32(lohalf ^ hihalf));
> }
>
> If we're going to do anything different here at all, the logical thing
> would be to simplify with murmurhash64() since datums are now 8 bytes
> everywhere. That's material for a separate patch, though.

Good point. Fixed.

--
Best regards,
Aleksander Alekseev

Attachment Content-Type Size
v6-0001-Refactor-bytea_sortsupport.patch text/x-patch 13.9 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2025-11-17 10:58:26 Re: Fix build of llvmjit_types.bc with meson
Previous Message Heikki Linnakangas 2025-11-17 10:55:06 Re: Running a single test