Re: Can we get sha* function over text, that could be used in index?

From: Linus Heckemann <linus(at)schreibt(dot)jetzt>
To: Daniel Verite <daniel(at)manitou-mail(dot)org>
Cc: pgsql-general mailing list <pgsql-general(at)postgresql(dot)org>
Subject: Re: Can we get sha* function over text, that could be used in index?
Date: 2026-02-21 09:54:53
Message-ID: ygaseaujuaa.fsf@localhost
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-general

"Daniel Verite" <daniel(at)manitou-mail(dot)org> writes:
>> - there is a byte-array representation of text columns, which appears to
>> be independent of database encoding
>
> Not sure what you're refering to. Both the on-disk and in-memory
> representations of text/varchar are encoding-dependent.

Huh, I thought I'd come to the opposite conclusion from testing the
other day (md5('ÿ') giving the same results for both a utf-8 database
and a latin1 database), but I must have been holding something wrong,
because trying again does confirm this. Sorry about that!

>> The obvious (to a naive user, like I was) approach, casting to bytea,
>> has exceptionally surprising behaviour: for many text strings, it does
>> exactly what the naive user might hope for, giving back the UTF-8
>> representation. But multiple distinct text strings, like '\033' and
>> '\x1b', convert to the same byte string! And text strings containing a
>> backslash that doesn't fit the bytea hex format or the bytea escape
>> format will fail to convert completely!
>
> Yes. It seems a common mistake to forget or ignore that
> backslashes are special in the input text representation of bytea.
> It might be not obvious from reading the doc at [1]
> but we just need to quote backslashes by doubling them.
>
> AFAIK a working solution for the OP would be:
> sha256(replace(colname, '\', '\\')::bytea)

That's so simple that it feels like it should have been
obvious... Thanks! Nevertheless, it feels like a weird and potentially
computationally expensive contortion for getting from a text value to
its bytea representation.

Given their encoding-dependence, shouldn't md5() (and ::bytea, for that
matter) be marked only STABLE, not IMMUTABLE? Or is IMMUTABLE's

> guaranteed to return the same results given the same arguments forever

only valid within the context of a single database?

md5() and ::bytea aside, it seems like it would be good to have both

- a function for accessing the bytea representation of a text value
without any extra steps, and

- an IMMUTABLE function for getting a UTF-8 (specifically UTF-8,
because it's the only encoding with a character set encompassing all
other supported character sets) bytea representation of a text value
regardless of the database encoding?

Cheers
Linus

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Pawel Kudzia 2026-02-21 10:20:07 Re: Silent data corruption in PostgreSQL 17 - how to detect it proactively?
Previous Message Daniel Verite 2026-02-20 21:40:37 Re: Can we get sha* function over text, that could be used in index?