| From: | Tim Anderson <postgresql(at)timando(dot)net> |
|---|---|
| To: | pgsql-novice(at)lists(dot)postgresql(dot)org |
| Subject: | Re: is there a way to automate deduplication of strings? |
| Date: | 2026-01-07 22:05:56 |
| Message-ID: | 75e0f4dd-cc89-4305-bb8b-4e19c0060ff4@timando.net |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-novice |
One thing you could do if you're OK with the space overhead of uuid vs
int, is use a hash of the user agent e.g. `md5(user_agent)::uuid` which
would reduce the need to lookup the value when inserting. Then when you
get a foreign key violation, add the user agent to the user_agent table.
On 1/1/26 10:25, Chris Papademetrious wrote:
>
> Hi Greg,
>
> Thanks for the reply! I tried to be vague to avoid getting distracted
> by the details, but I think I overdid it!
>
> Let’s say I have a table of transactions like this:
>
> CREATE TABLE transaction (
>
> id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
>
> user_agent TEXT NOT NULL,
>
> --
>
> -- ...more columns here...
>
> --
>
> );
>
> The table can contain millions of transactions. The *user_agent*field
> stores information about the application that performed each
> transaction. These user-agent values will be populated from a
> relatively small set of unique values. For example,
>
> MyFictitiousApp/1.0 (Linux; x86_64; Ubuntu 22.04.3 LTS) Desktop
> (BuildID 00000000-0000-0000-0000-000000000000)
>
> MyFictitiousApp/1.0 (Linux; x86_64; Ubuntu 22.04.3 LTS) Mobile
> (BuildID 00000000-0000-0000-0000-000000000000)
>
> MyFictitiousApp/1.0 (Windows 11 25H2) Desktop (BuildID
> 00000000-0000-0000-0000-000000000000)
>
> MyFictitiousApp/1.0 (Windows 11 25H2) Mobile (BuildID
> 00000000-0000-0000-0000-000000000000)
>
> MyFictitiousApp/1.1 (Linux; x86_64; Ubuntu 22.04.3 LTS) Desktop
> (BuildID 11111111-1111-1111-1111-111111111111)
>
> MyFictitiousApp/1.1 (Linux; x86_64; Ubuntu 22.04.3 LTS) Mobile
> (BuildID 11111111-1111-1111-1111-111111111111)
>
> MyFictitiousApp/1.1 (Windows 11 25H2) Desktop (BuildID
> 11111111-1111-1111-1111-111111111111)
>
> MyFictitiousApp/1.1 (Windows 11 25H2) Mobile (BuildID
> 11111111-1111-1111-1111-111111111111)
>
> The values themselves will vary over time (as new versions appear and
> old versions age out) so the set cannot be hardcoded, but the column
> will always contain large numbers of duplicate values.
>
> I could store the user-agent values in a separate table and reference
> them by a UUID computed from their value:
>
> CREATE TABLE user_agent (
>
> id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
>
> user_agent TEXT NOT NULL UNIQUE,
>
> );
>
> CREATE TABLE transaction (
>
> id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
>
> user_agent_id UUID NOT NULL,
>
> CONSTRAINT fk_user_agent FOREIGN KEY (user_agent_id) REFERENCES
> user_agent(id) ON DELETE CASCADE
>
> );
>
> but this adds transactional complexity for storage and retrieval,
> along with cleanup of no-longer-referenced values over time.
>
> I’m wishing for a magic “sparsely stored texts” column in Postgres
> that performs this deduplication automatically, but I don’t think it
> exists. So I’m wondering, is there an extension or some other trick to
> get the space savings without the transactional complexity?
>
> * Chris
>
> *From:*Greg Sabino Mullane <htamfids(at)gmail(dot)com>
> *Sent:* Wednesday, December 31, 2025 10:12 AM
> *To:* Chris Papademetrious <chrispy(at)synopsys(dot)com>
> *Cc:* pgsql-novice(at)lists(dot)postgresql(dot)org
> *Subject:* Re: is there a way to automate deduplication of strings?
>
> It is not quite clear what you are trying to do. Can you provide a
> small test table to show what you want to achieve?
>
> Cheers,
>
> Greg
>
> --
>
> Crunchy Data - https://www.crunchydata.com
> <https://urldefense.com/v3/__https:/www.crunchydata.com__;!!A4F2R9G_pg!dkWhEcznp8l1toENuJsDgY1GABWqYLWZBHncXtqiCaZguLzkN0U0-3JBHLCm7wE8gN0L4ZIwyXqM5pGNLIs7ORVTFh7KnJI$>
>
> Enterprise Postgres Software Products & Tech Support
>
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Chris Papademetrious | 2026-01-08 02:37:31 | RE: is there a way to automate deduplication of strings? |
| Previous Message | Greg Sabino Mullane | 2026-01-01 14:26:12 | Re: is there a way to automate deduplication of strings? |